PDF  PubReader

Park , Kang , and Joo: A Learning-based Distributed Algorithm for Scheduling in Multi-hop Wireless Networks

Daehyun Park , Sunjung Kang and Changhee Joo

A Learning-based Distributed Algorithm for Scheduling in Multi-hop Wireless Networks

Abstract: We address the joint problem of learning and scheduling in multi-hop wireless network without a prior knowledge on link rates. Previous scheduling algorithms need the link rate information, and learning algorithms often require a centralized entity and polynomial complexity. These become a major obstacle to develop an efficient learning-based distributed scheme for resource allocation in large-scale multi-hop networks. In this work, by incorporating with learning algorithm, we develop provably efficient scheduling scheme under packet arrival dynamics without a priori link rate information. We extend the results to distributed implementation and evaluation their performance through simulations.

Keywords: distributed algorithm , learning , multi-hop networks , provable efficiency , wireless scheduling


AS one of the key functions in wireless communication networks, link scheduling determines which links should be activated at what time. The problem is challenging, in particular, in multi-hop1 networks due to non-linear interference relationship between wireless links. The seminal work of Tassiulas and Ephremides has shown that the maximum weighted matching (MWM) algorithm that maximizes the queue weighted rate sum can achieve the optimal throughput under packet arrival dynamics [8]. Due to high computational complexity of MWM [1], alternative low-complexity scheduling solutions with comparable performance such as greedy maximal matching (GMM) or longest queue first (LQF) have attracted much attention [2], [3]. However, since GMM still has linear complexity that increases with the network size, it can be hardly used in large-size multi-hop networks.

1 In this work, we use the term ‘multi-hop’ for the arbitrary interference relationship between wireless links, and consider only single-hop traffic flows, i.e., a packet departs the network after a single transmission. If a scheduling scheme is suboptimal with these single-hop flows over multi-hop network, then clearly it is suboptimal with multi-hop flows. This approach has been widely adopted to investigate the performance of scheduling schemes without being affected by the functions of other layers such as routing and congestion control [1]–[7].

There has been extensive research on developing efficient scheduling algorithms that have sublinear complexity, yet perform provably well in multi-hop wireless networks. An approximation to GMM with logarithmic complexity has been developed in [4]. Random access technique with explicit neighborhood information exchanges has been explored at some expense of performance [5]–[7], [9]. Several studies have shown that the optimal throughput performance is achievable, either by taking the pick-and-compare approach [10], [11], or by exploiting the carrier-sensing functionality [12], [13]. There have been also attempts to develop provably efficient scheduling algorithms that work with time-varying wireless channels [4], [14] or with complex SINR interference model [15], [16]. The aforementioned scheduling schemes provide performance guarantees under packet arrival dynamics. However, they are limited to deterministic link rates that are known a priori or at the time of scheduling. Their extension to the case when the link rates are unknown is not straightforward.

In this work, we consider the scheduling problem, where link rates and statistics are unknown a priori. This occurs when new applications try to operate efficiently under uncertainty caused by wireless fading, interference, limited feedback, measurement error, system dynamics, etc [17]–[19]. We assume that an instance link rate is revealed when it is accessed/scheduled, and it is drawn from an unknown static distribution. Our goal is to find an appropriate sequence of link schedules that maximize throughput under packet arrival dynamics, while quickly learning the link rates and queue states.

Focusing on the learning aspect, the problem can be viewed as a variant of multi-armed bandit (MAB) problems, in which one repetitively plays a set of arms to maximize the reward sum [20]. The performance of a learning algorithm is often evaluated by regret, which is the difference in the total expected reward obtained by an optimal policy and that by the learning algorithm. Lai and Robbins have shown that the regret grows at least at logarithmic rate of time [21], and several index-type learning algorithms with the order-optimal regret have been developed [22], [23]. For a large-scale multi-hop wireless network, it is imperative to design algorithms that are amenable to implement in a distributed manner. In [24], the authors have developed a distributed learning algorithm that selects best M out of N arms, where each of M users selects an arm taking into consideration mutual collision. By employing a time-division selection approach, the scheme is shown to achieve logarithmic regret. Chen et al. [20] and Gai et al. [25] have considered more general problems of combi- natorial MAB (CMAB) with arbitrary constraint. They have employed an [TeX:] $$(\alpha, \beta)$$ approximation oracle that can achieve [TeX:] $$\alpha$$ fraction of the optimal value with probability [TeX:] $$\beta,$$ and developed learning schemes that can achieve the logarithmic growth for [TeX:] $$\alpha \beta$$ fraction of the optimal expected regret (denoted by [TeX:] $$\alpha \beta -$$ regret). However, an oracle with good [TeX:] $$\alpha \beta$$ (i.e., close to 1) often has a high-order polynomial complexity, and thus as the network scales, it is not clear whether the scheme is amenable to implement in a distributed manner.

Applying learning algorithms to scheduling in wireless networks, the works of [26], [27] have addressed the regretminimization problem in cognitive radio network settings, and developed distributed schemes with logarithmic regret through prioritized ranking and adaptive randomization. The authors of [28] have developed fully distributed schemes that can achieve the logarithmic regret without any information exchange. In [29], the authors have successfully lowered the algorithmic complexity to O(1) while achieving the logarithmic regret performance. Although these aforementioned learning algorithms are amenable to distributed implementation, they are limited to single-hop networks, such as wireless access networks, and to saturated traffic scenarios (i.e., links always have packets to send), and thus cannot accordingly respond to packet arrival dynamics. Recently, Stahlbuhk et al. [19] has developed a joint learning and scheduling scheme that provides provable efficiency under packet arrival dynamics, by incorporating GMM scheduling algorithm and UCB-based learning algorithm. Albeit interesting, it achieves only 1\2 of the capacity region and has linear complexity, which makes it less attractive for large-size networks.

In this work, we consider the joint problem of learning and scheduling in multi-hop wireless networks, and develop low-complexity schemes that achieves near-full capacity region under packet arrival dynamics. Our contribution can be summarized as follows:

We develop a joint learning and scheduling scheme with O(k) computational complexity, by successfully incorporating a graph augmentation algorithm with the UCB index. Parameter k is settable for a certain level of performance, in which case the complexity becomes O(1).

We show that [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}$$ is an [TeX:] $$(\alpha, \beta)$$ )-approximation oracle with [TeX:] $$\alpha=\frac{k-1}{k+1}$$ and some small non-zero [TeX:] $$\beta,$$ but can indeed achieve the logarithmic growth for [TeX:] $$\frac{k-1}{k+1}$$ -regret, regardless of the value of [TeX:] $$\beta,$$ which is in contrast to [TeX:] $$\alpha \beta$$ -regret shown in [20], [25].

By using the frame structure, we show that [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}$$ achieves [TeX:] $$\frac{k-1}{k+1}$$ fraction of the capacity region under packet arrival dynamics.

We extend our result and develop [TeX:] $$d \mathrm{~A}^{k}-\mathrm{UCB}$$ scheme that is amenable to implement in a completely distributed manner.

The rest of paper is organized as follows. Section II describes our system model. We propose a joint scheme of learning and scheduling in Section III, and analytically evaluate its performance in Section IV. We further extend our scheme for distributed implementation in Section V. Finally, we numerically evaluate our schemes in Section VI and conclude our work in Section VII.


We consider a multi-hop wireless network denoted by graph [TeX:] $$\mathcal{G}=(\mathcal{V}, \mathcal{L})$$ with the set [TeX:] $$\mathcal{V}$$ of nodes and the set [TeX:] $$\mathcal{L}$$ [TeX:] $$$$ of directional links. We assume that the connectivity is reciprocal, i.e., if [TeX:] $$(u, v) \in \mathcal{L}, \text { then }(v, u) \in \mathcal{L}.$$ A set of links that can be scheduled at the same time is constrained by the primary interference model, under which any node v (either transmitter or receiver) in the network can communicate with at most one of its neighbor nodes [TeX:] $$\mathcal{N}(v).$$, where [TeX:] $$\mathcal{N}(v)=\{u \in \mathcal{V} \mid(v, u) \in \{\mathcal{J}.$$ Slightly abusing the notation, we also denote the set of links that is connected to v by [TeX:] $$\mathcal{N}(v).$$ The primary interference model can represent Bluetooth or FH-CDMA networks as well as capture the essential feature of wireless interference [2], [19], and has been adopted in many studies on wireless scheduling, e.g., see [2]–[7] for more detailed description. Time is slotted, which can be achieved by being equipped with high accuracy GPS. At each time slot, a set of links that satisfies the interference constraints can be simultaneously activated. Such a set of links is called a matching (or a feasible schedule), and let [TeX:] $$\mathcal{S}$$ denote the set of all matchings.

At each link [TeX:] $$i \in \mathcal{L},$$ we assume packets arrive following a Bernoulli process with probability [TeX:] $$\lambda_{i}$$ (e.g., see [11]). Let [TeX:] $$\lambda$$ denote its vector and [TeX:] $$a_{i}(t) \in\{0,1\}$$ denote the number of arrived packets in time slot t. We have [TeX:] $$\mathbb{E}\left[a_{i}(t)\right]=\lambda_{i}.$$ We assume that the rate of link i is time-varying due to multipath fading and unknown interference as in [14], and it is independently drawn from a (possibly different) distribution with mean [TeX:] $$\mu_{i}.$$ Let [TeX:] $$\mu$$ denote its vector and [TeX:] $$X_{i}(t) \in[0,1]$$ denote the instance rate of link i when it is activated at time slot t, with [TeX:] $$\mathbb{E}\left[X_{i}(t)\right]=\mu_{i}.$$ The extension to multiple packet arrivals and departures is straightforward. We assume that [TeX:] $$\lambda$$ and [TeX:] $$\mu$$ are unknown.

At time slot t, if a policy activates matching [TeX:] $$\mathbf{S}_{t},$$ then each link [TeX:] $$i \in \mathbf{S}_{t}$$ accesses the medium and transmits [TeX:] $$X_{i}(t)$$ packets2 during the time slot. Each link i is associated with an unbounded buffer that queues up packets for transmission. Let [TeX:] $$q_{i}(t)$$ denote the queue length at link i at the beginning of time slot t, which evolves as

2 2or transmits 1 packet with success probability [TeX:] $$X_{i}(t).$$

[TeX:] $$q_{i}(t+1)= \begin{cases}{\left[q_{i}(t)-X_{i}(t)\right]^{+}+a_{i}(t),} & \text { if } i \in \mathbf{S}_{t}, \\ q_{i}(t)+a_{i}(t), & \text { if } i \notin \mathbf{S}_{t},\end{cases}$$

where [TeX:] $$[\cdot]^{+}=\max \{\cdot, 0\}.$$ Let q(t) denote its vector, and let [TeX:] $$q^{*}(t)=\max _{i \in \mathcal{L}} q_{i}(t)$$ denote the maximum queue length in the network at time slot t.

We consider a frame structure where each frame has length of T time slots. Frame n begins at time slot [TeX:] $$t_{n}=(n-1) T+1.$$ During frame n, i.e., for time slots [TeX:] $$t \in\left[t_{n}, t_{n+1}\right),$$ we define weight [TeX:] $$W_{i}(t)$$ and its mean [TeX:] $$w_{i}$$ of link i, respectively, as

[TeX:] $$W_{i}(t)=\frac{q_{i}\left(t_{n}\right)}{q^{*}\left(t_{n}\right)} X_{i}(t), \text { and } w_{i}=\frac{q_{i}\left(t_{n}\right)}{q^{*}\left(t_{n}\right)} \mu_{i}.$$

For [TeX:] $$q^{*}\left(t_{n}\right)=0,$$ we define [TeX:] $$W_{i}(t)=X_{i}(t) \text { and } w_{i}=\mu_{i}.$$ Let w denote its vector [TeX:] $$\left(w_{1}, w_{2}, \cdots, w_{|\mathcal{L}|}\right),$$ where [TeX:] $$|\cdot|$$ is the cardinality of the set. We denote the link weight sum of matching S by

[TeX:] $$r_{\boldsymbol{w}}(S)=\sum_{i \in S} w_{i}.$$

For convenience, we let [TeX:] $$r_{\boldsymbol{w}}^{*}=\max _{S \in \mathcal{S}} r_{\boldsymbol{w}}(S)$$ denote the largest weight sum over all matchings, and we also denote a set of optimal matchings by [TeX:] $$\mathcal{S}_{w}^{*}=\arg \max _{S \in \mathcal{S}} r_{\boldsymbol{w}}(S).$$ For [TeX:] $$\alpha \in(0,1],$$ we define a set of near-optimal matchings with respect to vector w as

[TeX:] $$\mathcal{S}_{\boldsymbol{w}}^{\alpha}=\left\{S \in \mathcal{S} \mid r_{\boldsymbol{w}}(S) \geq \alpha \cdot r_{\boldsymbol{w}}^{*}\right\},$$

and define its complement as [TeX:] $$\overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}=\mathcal{S}-\mathcal{S}_{\boldsymbol{w}}^{\alpha}.$$

In the CMAB framework, a link corresponds to an arm, a matching to a super arm, and the instance link rate to the reward of the link, respectively. We use the terms interchangeably. Note that the regret is defined as the accumulated expected difference between the reward sum associated with an optimal matching and that obtained by the MAB algorithm. Similar to [20], we define -regret as, for some [TeX:] $$\alpha \in(0,1],$$

[TeX:] $$\operatorname{Reg}^{\alpha}(t)=t \cdot \alpha \cdot r_{\boldsymbol{w}}^{*}-\mathbb{E}\left[\sum_{\tau=1}^{t} r_{\boldsymbol{w}}\left(\mathbf{S}_{\tau}\right)\right],$$

which evaluates the performance of an CMAB task at time t.

In the viewpoint of resource allocation, achieving a high reward sum is equivalent to achieving a larger queue-weighted link rate sum, which implies that the links with high demands and high service rates are scheduled first, and thus tends to stabilize the network. A network is said to be stable if the queues of all links are rate stable, i.e., [TeX:] $$\lim _{t \rightarrow \infty} q_{i}(t) / t=0$$ with probability 1 for all [TeX:] $$i \in \mathcal{L} . \text { Let } \Lambda$$ denote the capacity region, which is the set of arrival rate vectors [TeX:] $$\lambda$$ such that for any [TeX:] $$\lambda \in \Lambda,$$ there exists a policy that can make the network stable. We say that a scheduling policy has the stability region [TeX:] $$\gamma \Lambda$$ for some [TeX:] $$\gamma \in[0,1] \text {, },$$ if it can stabilize the networks for any arrival [TeX:] $$\lambda \in \gamma \Lambda.$$

We aim to develop a joint scheme of learning and scheduling that determines [TeX:] $$\mathrm{S}_{t}$$ to

[TeX:] $$maximize $\gamma$ \\ subject to $\lim _{t \rightarrow \infty} \frac{q_{i}(t)}{t}=0$, for any $\lambda \in \gamma \Lambda$, \\ X_{i}(t) \stackrel{\text { iid }}{\sim} \mathcal{D}_{i}, \text { and (1) },$$

where [TeX:] $$\mathcal{D}_{i}$$ denotes the distribution with finite support [0, 1]. The arrival vector [TeX:] $$\lambda$$ and the distributions [TeX:] $$\left\{\mathcal{D}_{i}\right\}$$ are unknown to the controller, and an observation on [TeX:] $$X_{i}(t)$$ is available only by scheduling link i at time t.

III. AUGMENTATION WITH [TeX:] $$\text { UCB INDEX (A } \left.^{k}-\mathrm{UCB}\right)$$

We develop a provably efficient joint scheme of learning and scheduling, by incorporating the augmentation algorithm presented in [11] with UCB index. We first describe the overall algorithm, and then explain the detailed operations.

We consider each frame time as an independent learning period, which allows us to decouple the learning from the scheduling. In the following, we describe the operation of our algorithm during a frame time. For the ease of exposition, we

Algorithm 1
Frame-based joint learning and scheduling

assume that our algorithm runs for time slot [1, T], where T is the frame length.

We start with some notations. Let [TeX:] $$q_{i}=q_{i}(1)$$ denote the initial (i.e., at the beginning of the frame) queue length of link (arm) i, and let [TeX:] $$q^{*}=\max _{i} q_{i} . \text { Let } \hat{\tau}_{i}(t)$$ denote the number of times that arm i is played up to time slot t, and let [TeX:] $$\hat{\tau}_{S}(t)$$ denote the number of times that matching S is played. The UCB index of arm i [22] is defined as

[TeX:] $$\bar{w}_{i, t}=\hat{w}_{i}(t-1)+\sqrt{\frac{(|\mathcal{L}|+1) \ln t}{\hat{\tau}_{i}(t-1)}},$$

where [TeX:] $$\hat{w}_{i}(t)=\frac{q_{i}}{q^{*}} \cdot \frac{1}{\hat{\tau}_{i}(t)} \sum_{n=1}^{t} X_{i}(n) \cdot \mathbb{I}\left\{i \in S_{n}\right\}$$ denotes average reward of arm i at time slot t weighted by [TeX:] $$\frac{q_{i}}{q^{*}}$$ and [TeX:] $$\mathbb{I}\{e\} \in\{0,1\}$$ denotes the indicator function that equals 1 if event e occurs, and 0 otherwise. All the variables of [TeX:] $$\bar{w}_{i, t}, \hat{w}_{i}, \hat{\tau}_{i}, q_{i}, q^{*}$$ will be reset at the beginning of each frame. Let [TeX:] $$\overline{\boldsymbol{w}}_{t}=\left(\bar{w}_{1, t}, \bar{w}_{2, t}, \cdots, \bar{w}_{|\mathcal{L}|, t}\right)$$ denotes the UCB index vector. Then [TeX:] $$r_{\overline{\boldsymbol{w}}_{t}}(S) \text { and } r_{\boldsymbol{w}_{t}}^{*}$$ are the index sum over links in matching S and its maximum value over all possible matchings, respectively. Also, we denote [TeX:] $$\mathcal{S}_{\bar{w}_{t}}^{*} \text { and } \mathcal{S}_{\bar{w}_{t}}^{\alpha}$$ as the set of matchings that achieve [TeX:] $$r_{\bar{w}_{t}}^{*}$$ nd those that achieve at least [TeX:] $$\alpha r_{\overline{\boldsymbol{w}}_{t}}^{*},$$ respectively.

We develop our joint learning and scheduling algorithm based on the generic UCB index, as shown in Algorithm 1, where we omit subscript t of [TeX:] $$\hat{w}_{i}(t) \text { and } \hat{\tau}_{i}(t)$$ for brevity. At time slot t = 1, it obtains two constant weight parameters [TeX:] $$q_{i}$$ and q* (line 1), and schedules arbitrary matchings for [TeX:] $$|\mathcal{L}|$$ time slots such that each link can be scheduled at least once (lines 3-5). Afterwards, it computes the index of each arm (line 7), selects matching [TeX:] $$\mathrm{S}_{t}$$ using the indices (line 8), and schedules it (lines 9-10).

The key part of the algorithm is about how to select matching [TeX:] $$\mathbf{S}_{t}$$ in line 8 such that it achieves high performance at low complexity. It has been shown in [20], [25] that, if we use an [TeX:] $$(\alpha, \beta)$$ -approximation oracle in the matching selection, the joint algorithm has [TeX:] $$\alpha \beta$$ -regret performance, which leads to achieving [TeX:] $$\alpha \beta$$ fraction of the capacity region. The problem is, that the parameter [TeX:] $$\beta$$ can be very small in particular when the network size is large. To this end, we introduce a class of augmentation algorithms with parameter k, which is an [TeX:] $$(\alpha, \beta)$$ -approximation oracle with [TeX:] $$\alpha=(k-1) /(k+1)$$ and

Fig. 1.

Time structure.

some small3 [TeX:] $$\beta>0 \text {, }$$ and has O(k) complexity.

3 3It depends on the network topology and the setting of p. In general, we have a smaller for a larger network.

We overview the augmentation algorithm. For the detailed description, we refer to [11] or our technical report [30]. We start with some definitions. Given a matching S, an augmentation A of matching S is a path or cycle where every alternate link is in S and has the property that if all links in [TeX:] $$A \cap S$$ are removed from S and all links in A - S are added to that S, then the resulting set of links is another matching in G. The latter process of finding new matching is called augmenting S with A, and the resulting matching is denoted by [TeX:] $$S \oplus A=(S-A) \cup(A-S).$$ A pair of augmentations [TeX:] $$A_{1}$$ and [TeX:] $$A_{2}$$ of matching S is disjoint if no two links in [TeX:] $$A_{1}-S$$ and [TeX:] $$A_{2}-S$$ are adjacent, i.e., if they do not share a common node. Let [TeX:] $$\mathcal{A}$$ denote a set of disjoint augmentations of matching S where every pair in [TeX:] $$\mathcal{A}$$ is disjoint. Then [TeX:] $$\bigcup_{A \in \mathcal{A}}(S \oplus A)$$ is also a matching in G.

The overall procedure of the augmentation algorithm at each time slot t is as follows.

1) At the beginning of the time slot, each link i is associated with some known weight [TeX:] $$w_{i}(t).$$

2) Given a valid matching [TeX:] $$\mathbf{S}_{t-1}$$ that is the schedule at time t - 1, it randomly generates a set[TeX:] $$\mathcal{A}$$ of disjoint augmentations of [TeX:] $$\mathrm{S}_{t-1}.$$

3) It compares the weight sum of [TeX:] $$A-\mathbf{S}_{t-1} \text { and } A \cap \mathbf{S}_{t-1}$$ for each [TeX:] $$A \in \mathcal{A}.$$ Let B(A) be the one with the larger weight sum among the two.

4) It takes the new schedule [TeX:] $$\mathbf{S}_{t} \text { as } \bigcup_{A \in \mathcal{A}} B(A).$$

For the comparison of the weight sum in the 3rd step, we define the gain of augmentation A as

[TeX:] $$G_{t}(A)=\sum_{i \in A-\mathbf{S}_{t-1}} w_{i}-\sum_{j \in A \cap \mathbf{S}_{t-1}} w_{j},$$

and obtain new schedule [TeX:] $$\mathbf{S}_{t}$$ by augmenting [TeX:] $$\mathrm{S}_{t-1}$$ with all [TeX:] $$A \in \ \mathcal{A} \text { of } G_{t}(A)>0.$$

The augmentation algorithm can accomplish the above procedure in a distributed fashion. The algorithm has two configuration parameters p and k, and consists of the following four stages in each time slot: initialization, augmenting, checking a cycle, and back-propagating/scheduling. For the ease of exposition, we consider additional time structure of mini-slots, as shown in Fig. 1, and all the four stages end in 4k + 2 mini-slots.

i) Initialization stage: At mini-slot [TeX:] $$\tau=1,$$ each node v selects itself as a seed with probability p. Once selected, it becomes an active node. It starts an augmentation Av and randomly selects [TeX:] $$\bar{Z}_{v} \in[1, k],$$ which is the maximum size4> of [TeX:] $$A_{v}.$$ Then, it adds the first link to [TeX:] $$A_{v}$$ from its neighboring links [TeX:] $$\mathcal{N}(v)$$ if there is a link in [TeX:] $$\mathbf{S}_{t-1} \cap \mathcal{N}(v),$$ then we include the link in [TeX:] $$A_{v},$$ and otherwise, we randomly choose a link from [TeX:] $$\mathcal{N}(v).$$ Once the first link (v; n) is chosen, the two nodes v and n coordinate with each other by exchanging necessary information through request (REQ) and acknowledgement (ACK) messages. The information (including current [TeX:] $$A_{v},$$ current [TeX:] $$G_{t}\left(A_{v}\right)$$ according to (7), [TeX:] $$\bar{Z}_{v},$$ etc) allows node n to continue building the augmentation. At the end of the first minislot, node v becomes inactive and node n becomes active.

4 The size Z of an augmentation A is defined as the number of new links in A, i.e., [TeX:] $$Z=\left|A-\mathbf{S}_{t-1}\right|$$

ii) Augmenting stage: It extends [TeX:] $$A_{2}$$ by adding a link at each mini-slot [TeX:] $$\tau \in[2,2 k+1].$$ Current active node selects new link that should be either a random neighboring link outside [TeX:] $$\mathbf{S}_{t-1}$$ (if the previously selected link is in [TeX:] $$\left.\mathbf{S}_{t-1}\right)$$ or a neighboring link in [TeX:] $$\mathbf{S}_{t-1}$$ (if the previously selected link is not in [TeX:] $$\left.\mathrm{S}_{t-1}\right).$$ It updates [TeX:] $$G_{t}\left(A_{v}\right) \text { and } A_{v}$$ accordingly, and proceeds a similar coordination through REQ and ACK messages, and at the end of mini-slot, active node changes to the corresponding end node of the new link. The extension continues until one of the following conditions hold: the size of [TeX:] $$A_{v} \text { equals } \bar{Z}_{v}, A_{v}$$ cannot be extended any more, or the extension fails due to message collision (see Fig. 2).

iii) Cycle-checking stage: When the augmenting stage finishes, the last node is called the terminus. The terminus checks whether the final augmentation [TeX:] $$A_{v}$$ forms a cycle or not. If it forms a cycle, the gain is updated to include the last link (n, s).

iv) Back-propagating/scheduling stage: The last stage is for back-propagating the final gain from the terminus to the seed through [TeX:] $$A_{v},$$ and in the meantime, constructing [TeX:] $$S^{*}$$ either by scheduling links outside [TeX:] $$\mathbf{S}_{t-1}$$ if the gain is positive, or by scheduling links in [TeX:] $$\mathbf{S}_{t-1}$$ if the gain is negative. This takes additional 2k +1 mini-slots at most. The final result [TeX:] $$S^{*}$$ will be used as the new schedule [TeX:] $$\mathbf{S}_{t}$$ during time slot t.

Remarks: In our description, we present [TeX:] $$\mathbf{S}_{t-1}$$ variable, but each node v indeed requires only the local view of it, i.e., [TeX:] $$\mathcal{N}(v) \cap \mathbf{S}_{t-1},$$ which can be obtained during the back-propagation in the last stage of the previous time slot. After all the stages, a set of augmentations (one per a seed node) will be generated. Since a size[TeX:] $$-\bar{Z}$$ augmentation A can have at most [TeX:] $$\bar{Z}+1$$ links of [TeX:] $$\mathbf{S}_{t-1} \text { and } \bar{Z}$$ new links, the number of total links in A can be up to [TeX:] $$2 \bar{Z}+1 \leq 2 k+1.$$

Fig. 2 illustrates an example operation of the augmentation algorithm during time slot t in a [TeX:] $$3 \times 4$$ grid topology. Nodes are dots, and solid lines are links. The previous schedule [TeX:] $$\mathbf{S}_{t-1}$$ is marked by thick solid lines in Fig. 2(a)-(e). At the beginning [TeX:] $$(\tau=1),$$ each node selects itself as a seed with probability p. In this example, three nodes are selected and become an active as marked by (white) numbered circles in Fig. 2(a). Active nodes 1 and 2 have to start its augmentation with the previous scheduled link, while active node 3 selects one of three neighboring nodes at random. The solid arrow

Fig. 2.

Example operation of the augmentation algorithm with k = 2 in a time slot. After the cycle-checking stage, link weights of final augmentations are shown in (e). After back-propagation, links are accordingly augmented with [TeX:] $$A_{1},$$ and the final schedule is marked by thick solid lines in (f).

denotes the link selected by the active node. Once selected, the nodes exchange the necessary information. In the next mini-slot ( = 2), active nodes change as shown in Fig. 2(b). Narrow dotted arrows denote the augmentation up to now. The active nodes continue to build the augmentation repeating the procedure, until the augmentation cannot be extended or it reaches the maximum size. In the meantime, if two augmentations collide as shown by active nodes 2 and 3 in Fig. 2(b), both augmentations terminate. The node at the collision point will belong to the augmentation that follows a link in [TeX:] $$\mathbf{S}_{t-1},$$ as shown in Fig. 2(c). The terminus nodes are marked by solid (black) number circles. The result after (2k+1)th mini-slots is shown in Fig. 2(e), where each of three augmentations is marked by a dotted enclosure. The number of links in the augmentations denote their weight. Then after checking a cycle, the back-propagating stage follows. Each terminus makes the final decision by comparing the weight sum as in (7). The decision propagates backward through the augmentation, and leads to new schedule [TeX:] $$\mathbf{S}_{t}$$ as shown in Fig. 2(f).

By using the aforementioned augmentation algorithm in matching selection of line 8 in Algorithm 1, we complete [TeX:] $$A^{k}$$ - UCB scheme. However, evaluating the performance of [TeX:] $$\mathrm{A}^{k}-$$ UCB is not straightforward. Since the scheduler do not know the true link rate [TeX:] $$\mu_{i}$$ and instead use the experience-based UCB index [TeX:] $$\bar{w}_{i, t},$$ the previous analysis of [11] for scheduling performance is not applicable due to inaccurate link rate information. Also for learning performance, considering [TeX:] $$\mathrm{A}^{k}-$$ UCB as an [TeX:] $$(\alpha, \beta) \text {-oracle }$$ is not much helpful due to the fact that probability [TeX:] $$\beta$$ can be arbitrarily small as the network scales up.

Our main contribution is to analytically characterize the performance of our joint learning and scheduling scheme [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB},$$ and to show that it can achieves the rate-optimal logarithmic growth of [TeX:] $$\frac{k-1}{k+1} \text {-regret }$$ regardless of the network size in the learning, and further it has the close-to-optimal stability region that equals [TeX:] $$\frac{k-1}{k+1} \Lambda$$ in the scheduling.


We first consider the regret performance of [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}$$ in a single frame, and then evaluate its scheduling performance across frames.

A. Regret Performance in a Single Frame

We show that [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}$$ has distribution-dependent upper bound of O(log T) on the regret in a frame of length T. We start with the following lemma.

Lemma 1. Given any [TeX:] $$\mathbf{S}_{t-1},$$ weight [TeX:] $$\overline{\boldsymbol{w}}_{t},$$ and a fixed [TeX:] $$k>0 \text {, }$$ there exists [TeX:] $$\delta>0$$ such that, with probability at least [TeX:] $$\delta,$$ the augmentation algorithm generates a set [TeX:] $$\mathcal{A}^{*}$$ of disjoint augmentations that satisfies [TeX:] $$\left(\mathbf{S}_{t-1} \oplus \mathcal{A}^{*}\right) \in \mathcal{S}_{\bar{w}_{t}}^{\alpha} \text {, i.e., } \ \operatorname{Pr}\left\{\left(\mathbf{S}_{t-1} \oplus \mathcal{A}^{*}\right) \in \mathcal{S}_{\bar{w}_{t}}^{\alpha}\right\} \geq \delta,$$ or equivalently,

[TeX:] $$\operatorname{Pr}\left\{r_{\overline{\boldsymbol{w}}_{t}}\left(\mathbf{S}_{t-1} \oplus \mathcal{A}^{*}\right) \geq \frac{k-1}{k+1} \cdot r_{\boldsymbol{w}_{t}}^{*}\right\} \geq \delta,$$

where [TeX:] $$\delta \geq \min \left\{1,\left(\frac{p}{1-p}\right)^{|\mathcal{V}|}\right\} \cdot\left(\frac{1-p}{k \Sigma}\right)^{|\mathcal{V}|},|\mathcal{V}|$$ is the number of nodes, and [TeX:] $$\Sigma$$ is the maximum node degree.

Lemma 1 means that the augmentation algorithm is an [TeX:] $$(\alpha, \beta) \text { - }$$ approximation oracle with [TeX:] $$\alpha=\frac{k-1}{k+1} \text { and } \beta>0, \text { where } \beta,$$ can be arbitrarily small according to the network size. The proof follows the same line of analysis of [11] and thus omitted. For the completeness, we provide the proof in [30].

Next, we need a generalized version of the decomposition inequality for -regret. From (5), we have

[TeX:] $$\begin{aligned} \operatorname{Reg}^{\alpha}(t)=t \cdot \alpha \cdot r_{\boldsymbol{w}}^{*}-\mathbb{E}\left[\sum_{\tau=1}^{t} r_{\boldsymbol{w}}\left(\mathbf{S}_{\tau}\right)\right] \\ =\sum_{\tau=1}^{t} \sum_{S \in \mathcal{S}} \mathbb{E}\left[\mathbb{I}\left\{\mathbf{S}_{\tau}=S\right\} \cdot\left(\alpha r_{\boldsymbol{w}}^{*}-r_{\boldsymbol{w}}(S)\right)\right] \\ \leq \sum_{S \in \mathcal{S}} \mathbb{E}\left[\hat{\tau}_{S}(t)\right] \cdot \Delta_{\max }^{\alpha}, \end{aligned}$$

where [TeX:] $$\Delta_{\max }^{\alpha}=\alpha \cdot r_{w}^{*}-\min _{S \in \mathcal{S}^{\alpha}} r_{\boldsymbol{w}}(S)$$ is the maximum nearoptimal gap. Similarly we define the minimum near-optimal gap [TeX:] $$\Delta_{\min }^{\alpha}=\alpha \cdot r_{\boldsymbol{w}}^{*}-\max _{S \in \overline{\mathcal{S}}^{\alpha}} r_{\boldsymbol{w}}(S).$$

The following lemma ensures that, if a non-near-optimal matching in [TeX:] $$\overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}$$ is played many times, then its index sum is smaller than that of any near-optimal matching in [TeX:] $$\mathcal{S}_{\boldsymbol{w}}^{\alpha}.$$

Lemma 2. Given a frame of length T, if a non-near-optimal matching [TeX:] $$S \in \overline{\mathcal{S}}_{w}^{\alpha}$$ is played more than [TeX:] $$l_{T}=\left\lceil\frac{4|\mathcal{L}|^{2}(|\mathcal{L}|+1) \ln T}{\Delta_{\min }^{\alpha}}\right.$$ times by t-th time slot in the frame, then the probability that the total sum of UCB indices over S at time slot t is greater than that over any near-optimal matching [TeX:] $$S^{\prime} \in \mathcal{S}_{w}^{\alpha}$$ is bounded by

[TeX:] $$\operatorname{Pr}\left\{r_{\overline{\boldsymbol{w}}_{t}}(S) \geq r_{\overline{\boldsymbol{w}}_{t}}\left(S^{\prime}\right)\right\} \leq 2|\mathcal{L}| t^{-2},$$

for all [TeX:] $$t \leq T$$ such that [TeX:] $$\hat{\tau}_{S}(t) \geq l_{T}.$$

We emphasize that [TeX:] $$S_{w}^{\alpha} \text { and } \overline{\mathcal{S}}_{w}^{\alpha}$$ are defined with true weight w, while the matching comparison is based on UCB index [TeX:] $$\overline{\boldsymbol{w}}_{t}.$$ The lemma shows that the augmentation algorithm may still work well, even when the true weight is replaced with the UCB index. The proof of the lemma is analogous to Lemma A.1 of [29] but has some differences due to ’near-optimality’. It can be found in [30].

One of our main results, the regret bound of [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB},$$ can be obtained as follows.

Proposition 1. For a network graph [TeX:] $$\mathcal{G}=(\mathcal{V}, \mathcal{L}), A^{k}-U C B$$ achieves the regret performance bound of

[TeX:] $$\operatorname{Reg}^{\alpha}(t) \leq \Delta_{\max }^{\alpha}\left[D_{1} \cdot \frac{\log t}{\left(\Delta_{\min }^{\alpha}\right)^{2}}+D_{2}\right],$$

for all [TeX:] $$t \in\{1, \cdots, T\},$$ where [TeX:] $$D_{1}=\left(1+\frac{1}{\delta}\right) \cdot 4|\mathcal{L}|^{2}(|\mathcal{L}|+1).$$ [TeX:] $$(|\mathcal{S}|-1), \quad D_{2}=\frac{|\mathcal{S}|-1}{\delta}\left(1+\frac{|\mathcal{L}| \delta \pi^{2}}{3}+\frac{|\mathcal{L}|(|\mathcal{S}|-2) \pi^{2}}{6}\right)+\frac{1-\delta}{\delta}+ \ \frac{2|\mathcal{L}| \pi^{2}}{3 \delta}, \alpha=\frac{k-1}{k+1}, \text { and } \delta=\min \left\{1,\left(\frac{p}{1-p}\right)^{|\mathcal{V}|}\right\} \cdot\left(\frac{1-p}{k \Sigma}\right)^{|\mathcal{V}|}.$$

It shows that [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}$$ achieves the logarithmic growth O(log T) of -regret. Note that although the result is somewhat similar to those in [20], [25], their proof techniques are not applicable. Suppose that at time slot t, [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}$$ randomly generates a set [TeX:] $$\mathcal{A}_{t}$$ of augmentations based on the previous schedule [TeX:] $$\mathbf{S}_{t-1}, \text { and } \mathcal{A}_{t}$$ consists of a single augmentation for the ease of exposition. It is possible that both the matchings, previous schedule [TeX:] $$\mathbf{S}_{t-1}$$ and schedule [TeX:] $$\mathbf{S}_{t-1} \oplus \mathcal{A}_{t}$$ generated by the augmentation algorithm, are non-near-optimal. This implies that we cannot ensure that the index sum of the chosen schedule (i.e., either [TeX:] $$\left.\mathbf{S}_{t-1} \text { or } \mathbf{S}_{t-1} \oplus \mathcal{A}_{t}\right)$$ is greater than [TeX:] $$\alpha r_{\overline{\boldsymbol{w}}_{t}}^{*},,$$ because the index-sum comparison is done only between the two non-near-optimal matchings. This, the lack of comparison with the optimal matching (or near-optimal matching in our case) at every time slot, makes the previous regret analysis technique non-applicable. We successfully address the difficulties by grouping the plays of non-nearoptimal matchings.

Overall, we show that the number of explorations to nonnear- optimal matchings is bounded. To this end, we consider a sequence of time points where a non-near-optimal matching is sufficiently played at each point. They serve as a foothold to count the total number of plays of non-near-optimal matchings.

To begin with, for an arbitrary fixed time [TeX:] $$h>0, \text { let } l_{h}= \ \left\lceil\frac{4|\mathcal{L}|^{2}(|\mathcal{L}|+1) \ln h}{\left(\Delta^{\alpha},\right)^{2}}\right\rceil, \text { and let } \hat{t}_{h}$$ denote the first time when all non-near-optimal matchings are sufficiently (i.e., more than [TeX:] $$l_{h}$$ times) explored, i.e.,

[TeX:] $$\hat{t}_{h}=\min \left\{t \mid \hat{\tau}_{S}(t) \geq l_{h} \text { for all } S \in \overline{\mathcal{S}}_{w}^{\alpha}\right\}.$$

(1) When [TeX:] $$\hat{t}_{h} \leq h: \text { Let } \overline{\mathcal{S}}_{w}^{\alpha}=\left\{S^{1}, S^{2}, \cdots, S^{M}\right\} \text { with } M= \ \left|\mathcal{S}_{\boldsymbol{w}}^{\alpha}\right|.$$ Further we define [TeX:] $$\overline{\mathcal{S}}(t)=\left\{S \in \mathcal{S}_{w}^{\alpha} \mid \hat{\tau}_{S}(t) \geq l_{h}\right\},,$$ which is the set of non-near-optimal matchings that are scheduled sufficiently many times by time t, and [TeX:] $$\underline{\mathcal{S}}(t) \ =\overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}-\overline{\mathcal{S}}(t)$$ denotes the set of not-yet-sufficiently-scheduled non-nearoptimal matchings. Also, let [TeX:] $$t^{n}$$ denote the time when matching [TeX:] $$S^{n}$$ is sufficiently scheduled, i.e., [TeX:] $$\hat{\tau}_{S^{n}}\left(t^{n}\right)=l_{h}.$$ Without loss of generality, we assume [TeX:] $$t^{1}<t^{2}<\cdots <t^{M}=\hat{t}_{h}.$$

To apply the decomposition inequality (9), we need to estimate the expected value of [TeX:] $$\sum_{S \in \mathcal{S}_{w}^{\alpha}} \hat{\tau}_{S}\left(\hat{t}_{h}\right),$$ which can be written as

[TeX:] $$\begin{aligned} \sum_{S \in \overline{\mathcal{S}}_{w}^{\alpha}} \hat{\tau}_{S}\left(\hat{t}_{h}\right)=\sum_{S \in \overline{\mathcal{S}}_{w}^{\alpha}} \sum_{t=1}^{\hat{t}_{h}} \mathbb{I}\left\{\mathbf{S}_{t}=S\right\} \\ =l_{h} M+\sum_{n=1}^{M-1} \sum_{t=t^{n}+1}^{t+1} \sum_{S \in \overline{\mathcal{S}}\left(t^{n}\right)} \mathbb{I}\left\{\mathbf{S}_{t}=S\right\}. \end{aligned}$$

Hence, we need to estimate [TeX:] $$\sum_{S \in \overline{\mathcal{S}}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t}=S\right\} \text { for } t \in \ \left(t^{n}, t^{n+1}\right], which can be obtained as in the following lemma.$$

Lemma 3. For each [TeX:] $$t \in\left(t^{n}, t^{n+1}\right],$$ we have

[TeX:] $$\begin{aligned} \sum_{S \in \overline{\mathcal{S}}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t}=S\right\} \\ \leq(1-\delta) \cdot \sum_{S \in \overline{\mathcal{S}}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t-1}=S\right\} \\ \quad+\operatorname{Pr}\left\{\mathbf{S}_{t-1} \in \mathcal{S}\left(t^{n}\right)\right\}+\left(\left|\overline{\mathcal{S}}\left(t^{n}\right)\right|+\delta\right) \cdot 2|\mathcal{L}| t^{-2}. \end{aligned}$$

Proof. We first divide the case into three exclusive sub-cases based on the previous schedule [TeX:] $$\mathbf{S}_{t-1}: \text { events } \mathbb{A}=\left\{\mathbf{S}_{t-1} \in\right. \ \left.\mathcal{S}_{\boldsymbol{w}}^{\alpha}\right\}, \mathbb{B}=\left\{\mathbf{S}_{t-1} \in \underline{\mathcal{S}}\left(t^{n}\right)\right\} \text {, and } \mathbb{C}=\left\{\mathbf{S}_{t-1} \in \overline{\mathcal{S}}\left(t^{n}\right)\right\}.$$ Then we have

[TeX:] $$\begin{aligned} \sum_{S \in \overline{\mathcal{S}}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t}=S\right\} \\ =\sum_{S \in \mathcal{S}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t}=S \mid \mathbb{A}\right\} \cdot \operatorname{Pr}\{\mathbb{A}\} \end{aligned}$$

[TeX:] $$+\sum_{S \in \overline{\mathcal{S}}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t}=S \mid \mathbb{B}\right\} \cdot \operatorname{Pr}\{\mathbb{B}\}$$

[TeX:] $$+\sum_{S \in \overline{\mathcal{S}}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t}=S \mid \mathbb{C}\right\} \cdot \operatorname{Pr}\{\mathbb{C}\}.$$

Let [TeX:] $$A_{t}$$ denote the set of augmentations chosen under our algorithm at time t. We can obtain a bound on (13) as

[TeX:] $$\begin{aligned} \sum_{S \in \mathcal{S}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t}=S \mid \mathbb{A}\right\} \cdot \operatorname{Pr}\{\mathbb{A}\} \\ \leq \sum_{S \in \overline{\mathcal{S}}\left(t^{n}\right)} \operatorname{Pr}\left\{r_{\bar{w}_{t}}(S) \geq r_{\bar{w}_{t}}\left(\mathbf{S}_{t-1}\right) \mid \mathbb{A}\right\} \cdot \operatorname{Pr}\{\mathbb{A}\} \\ \leq\left|\overline{\mathcal{S}}\left(t^{n}\right)\right| \cdot 2|\mathcal{L}| t^{-2}, \end{aligned}$$

where the last inequality comes from Lemma 2. The result holds for all [TeX:] $$t \in\left(t^{n}, t^{n+1}\right].$$ For the second term (14), we have

[TeX:] $$\sum_{S \in \overline{\mathcal{S}}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t}=S \mid \mathbb{B}\right\} \cdot \operatorname{Pr}\{\mathbb{B}\} \leq \operatorname{Pr}\{\mathbb{B}\}.$$

Finally, the third term (15) denotes the probability to transit from a sufficiently-played non-near-optimal matching to a sufficiently-played non-near-optimal matching, and thus we have

[TeX:] $$\begin{aligned} \sum_{S \in \mathcal{S}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t}=S \mid \mathbb{C}\right\} \cdot \operatorname{Pr}\{\mathbb{C}\} \\ =\sum_{S \in \overline{\mathcal{S}}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t} \in \overline{\mathcal{S}}\left(t^{n}\right) \mid \mathbf{S}_{t-1}=S\right\} \cdot \operatorname{Pr}\left\{\mathbf{S}_{t-1}=S\right\}. \end{aligned}$$

Letting [TeX:] $$S^{\prime}=S \oplus \mathcal{A}_{t}$$ and using Lemma 1, the conditional probability can be derived as

[TeX:] $$\begin{aligned} \operatorname{Pr}\left\{\mathbf{S}_{t} \in \overline{\mathcal{S}}\left(t^{n}\right) \mid \mathbf{S}_{t-1}=S\right\} \\ \leq \operatorname{Pr}\left\{\mathbf{S}_{t} \in \overline{\mathcal{S}}\left(t^{n}\right) \mid \mathbf{S}_{t-1}=S, S^{\prime} \in \mathcal{S}_{\boldsymbol{w}}^{\alpha}\right\} \cdot \delta+(1-\delta) \\ =\operatorname{Pr}\left\{r_{\bar{w}_{t}}(S) \geq r_{\bar{w}_{t}}\left(S^{\prime}\right)\right\} \cdot \delta+(1-\delta) \\ \leq 2|\mathcal{L}| t^{-2} \cdot \delta+(1-\delta). \end{aligned}$$

where the equality holds since [TeX:] $$\mathbf{S}_{t}$$ should be S (otherwise, [TeX:] $$\left.\mathbf{S}_{t}=\left(S \oplus \mathcal{A}_{t}\right) \notin \overline{\mathcal{S}}\left(t^{n}\right)\right)$$ and thus S should have the larger weight sum to be chosen by the augmentation algorithm, and the last inequality comes from Lemma 2. Hence, the third term (15) can be upper bounded by

[TeX:] $$\begin{aligned} \sum_{S \in \mathcal{S}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t-1}=S\right\} \cdot\left(2 \delta|\mathcal{L}| t^{-2}+1-\delta\right) \\ \leq 2 \delta|\mathcal{L}| t^{-2}+(1-\delta) \sum_{S \in \bar{S}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t-1}=S\right\}, \end{aligned}$$

for all [TeX:] $$t \in\left(t^{n}, t^{n+1}\right].$$

The result can be obtained by combining (16), (17), and (18).

In order to apply Lemma 3 to (11), we rewrite it in a recursive form. Let [TeX:] $$\eta=1-\delta, G_{n}=\left(\left|\overline{\mathcal{S}}\left(t^{n}\right)\right|+\delta\right) \cdot 2|\mathcal{L}|= (n+\delta) \cdot 2|\mathcal{L}| \text {, and } \Theta_{n}(t)=\sum_{S \in \overline{\mathcal{S}}\left(t^{n}\right)} \operatorname{Pr}\left\{\mathbf{S}_{t}=S\right\}.$$ We have a recursive form of (12) as

[TeX:] $$\Theta_{n}(t) \leq \operatorname{Pr}\left\{\mathbf{S}_{t-1} \in \mathcal{S}\left(t^{n}\right)\right\}+G_{n} \cdot t^{-2}+\eta \Theta_{n}(t-1),$$

for [TeX:] $$t \in\left(t^{n}, t^{n+1}\right].$$ Extending the right side further down to [TeX:] $$t^{n},$$ we can obtain that

[TeX:] $$\Theta_{n}(t) \leq \eta^{t-t^{n}} \Theta_{n}\left(t^{n}\right)$$

[TeX:] $$+G_{n} \sum_{i=t^{n}+1}^{t} \eta^{t-i} \cdot i^{-2}$$

[TeX:] $$+\sum_{i=t^{n}+1}^{t} \eta^{t-i} \cdot \operatorname{Pr}\left\{\mathbf{S}_{i-1} \in \underline{\mathcal{S}}\left(t^{n}\right)\right\}.$$

By summing it over [TeX:] $$t \in\left(t^{n}, t^{n+1}\right]$$ on the both sides, we obtain the following lemma.

Lemma 4. The total number of times that sufficiently played non-near-optimal matchings are selected during [TeX:] $$\left(t^{n}, t^{n+1}\right]$$ is bounded by

[TeX:] $$\sum_{t=t^{n}+1}^{t^{n+1}} \Theta_{n}(t) \leq \frac{1}{\delta}\left(1+\frac{\pi^{2}}{6} G_{n}+\mathbb{E}\left[\sum_{S \in \mathcal{S}\left(t^{n}\right)} \tau_{S, n+1}\right]\right),$$

where [TeX:] $$\tau_{S, n+1}$$ denote the number of time slots that S is scheduled in [TeX:] $$\left(t^{n}, t^{n+1}\right].$$

The proof of Lemma 4 is omitted and can be found in [30].

Now, by taking expectation on (11), we can obtain the expected total number of times that non-near optimal matchings are selected up to time [TeX:] $$\hat{t}_{h}(\leq h)$$ as

[TeX:] $$$$

[TeX:] $$\begin{aligned} \sum_{S \in \overline{\mathcal{S}}_{w}} \mathbb{E}\left[\hat{\tau}_{S}\left(\hat{t}_{h}\right)\right]=l_{h} M+\sum_{n=1}^{M-1} \sum_{t=t^{n}+1}^{t^{n+1}} \Theta_{n}(t) \\ \leq l_{h} M+\frac{1}{\delta} \sum_{n=1}^{M-1}\left(1+G_{n} \frac{\pi^{2}}{6}+\mathbb{E}\left[\sum_{S \in \mathcal{S}\left(t^{n}\right)} \tau_{S, n+1}\right]\right), \\ \leq l_{h} M+\frac{M}{\delta}\left(1+\frac{|\mathcal{L}| \delta \pi^{2}}{3}+\frac{|\mathcal{L}|(M-1) \pi^{2}}{6}+l_{h}\right). \end{aligned}$$

The last inequality holds since (i) [TeX:] $$\sum_{n=1}^{M-1} G_{n}=\sum_{n=1}^{M-1}(n+ \text { \delta) } 2|\mathcal{L}| \leq M \cdot|\mathcal{L}| \cdot(2 \delta+(M-1)) \text {, and (ii) } \sum_{S \in \mathcal{S}\left(t^{n}\right)} \tau_{S, n+1} $$ is the total number that the matchings that have been chosen less than [TeX:] $$l_{h} \text { up to } t^{n}$$ are chosen during [TeX:] $$\left(t^{n}, t^{n+1}\right]$$ and thus results in [TeX:] $$\sum_{n=1}^{M^{\prime}-1} \sum_{S \in \mathcal{S}\left(t^{n}\right)} \tau_{S, n+1} \leq \sum_{k=2}^{M} l_{h} \leq l_{h} M.$$ From [TeX:] $$M \leq|\mathcal{S}|-1,$$ we have

[TeX:] $$\begin{aligned} \sum_{S \in \mathcal{S}_{w}^{\alpha}} \mathbb{E}\left[\hat{\tau}_{S}\left(\hat{t}_{h}\right)\right] \leq D_{1} \cdot \frac{\ln h}{\left(\Delta_{\min }^{\alpha}\right)^{2}} \\ +\frac{|\mathcal{S}|-1}{\delta}\left(1+\frac{|\mathcal{L}| \delta \pi^{2}}{3}+\frac{|\mathcal{L}|(|\mathcal{S}|-2) \pi^{2}}{6}\right) . \end{aligned}$$

This provides a bound on the number of times that nonnear- optimal matchings are selected up to [TeX:] $$\hat{t}_{h}.$$ For the rest time [TeX:] $$t \in\left(\hat{t}_{h}, h\right],$$ we need to compute [TeX:] $$\sum_{t=\hat{t}_{h}+1}^{h} \operatorname{Pr}\left\{\mathbf{S}_{t} \in \overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}\right\}.$$ Let [TeX:] $$S^{\prime}=\mathbf{S}_{t-1} \oplus \mathcal{A}_{t}.$$ Since next schedule [TeX:] $$\mathbf{S}_{t}$$ is either [TeX:] $$\mathbf{S}_{t-1} \text { and } S^{\prime}$$ under the algorithm, we divide the event [TeX:] $$\left\{\mathbf{S}_{t} \in \overline{\mathcal{S}}_{w}^{\alpha}\right\}$$ into three sub-cases based on [TeX:] $$\mathbf{S}_{t-1} \text { and } S^{\prime},$$ and compute the probability as

[TeX:] $$\begin{aligned} \operatorname{Pr}\left\{\mathbf{S}_{t} \in \overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}\right\}=\operatorname{Pr}\left\{S^{\prime} \in \overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}, \mathbf{S}_{t-1} \in \overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}\right\} \\ +\operatorname{Pr}\left\{S^{\prime} \in \overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}, \mathbf{S}_{t-1} \in \mathcal{S}_{w}^{\alpha}, r_{\boldsymbol{w}}\left(S^{\prime}\right) \geq r_{\boldsymbol{w}}\left(\mathbf{S}_{t-1}\right)\right\} \\ +\operatorname{Pr}\left\{S^{\prime} \in \mathcal{S}_{\boldsymbol{w}}^{\alpha}, \mathbf{S}_{t-1} \in \overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}, r_{\boldsymbol{w}}\left(S^{\prime}\right) \leq r_{\boldsymbol{w}}\left(\mathbf{S}_{t-1}\right)\right\}. \end{aligned}$$

This leads to

[TeX:] $$\begin{aligned} \operatorname{Pr}\left\{\mathbf{S}_{t} \in \mathcal{S}_{w}^{\alpha}\right\} \leq \operatorname{Pr}\left\{S^{\prime}\right.\left.\in \mathcal{S}_{w}^{\alpha} \mid \mathbf{S}_{t-1} \in \mathcal{S}_{w}^{\alpha}\right\} \cdot \operatorname{Pr}\left\{\mathbf{S}_{t-1} \in \mathcal{S}_{w}^{\alpha}\right\} \\ + \operatorname{Pr}\left\{r_{\boldsymbol{w}}\left(S^{\prime}\right) \geq r_{\boldsymbol{w}}\left(\mathbf{S}_{t-1}\right) \mid S^{\prime} \in \overline{\mathcal{S}}_{w}^{\alpha}, \mathbf{S}_{t-1} \in \mathcal{S}_{w}^{\alpha}\right\} \\ +\operatorname{Pr}\left\{r_{\boldsymbol{w}}\left(S^{\prime}\right) \leq r_{\boldsymbol{w}}\left(\mathbf{S}_{t-1}\right) \mid S^{\prime} \in \mathcal{S}_{\boldsymbol{w}}^{\alpha}, \mathbf{S}_{t-1} \in \mathcal{S}_{w}^{\alpha}\right\}. \end{aligned}$$

From Lemma 1, we have [TeX:] $$\operatorname{Pr}\left\{S^{\prime} \in \overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha} \mid \mathbf{S}_{t-1} \in \overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}\right\}= \ 1-\operatorname{Pr}\left\{S^{\prime} \in \mathcal{S}_{\boldsymbol{w}}^{\alpha} \mid \mathbf{S}_{t-1} \in \overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}\right\} \leq 1-\delta=\eta . \text { Since } \hat{\tau}_{S}(t) \geq l_{h}$$ for all S and [TeX:] $$t \in\left(\hat{t}_{h}, h\right],$$ Lemma 2 provides an upper bound [TeX:] $$2|\mathcal{L}| t^{-2}$$ on each conditional probability in the second and the third terms. As a result, we can obtain

[TeX:] $$\operatorname{Pr}\left\{\mathbf{S}_{t} \in \mathcal{S}_{\boldsymbol{w}}^{\alpha}\right\} \leq \eta \cdot \operatorname{Pr}\left\{\mathbf{S}_{t-1} \in \mathcal{S}_{w}^{\alpha}\right\}+4|\mathcal{L}| t^{-2}.$$

By extending the inequality in a recursive manner down to [TeX:] $$\hat{t}_{h},$$ we obtain that

[TeX:] $$\begin{aligned} \operatorname{Pr}\left\{\mathbf{S}_{t} \in \overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}\right\} \leq \eta^{t-\hat{t}_{h}} \cdot \operatorname{Pr}\left\{\mathbf{S}_{\hat{t}_{h}} \in \overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}\right\} \\ +4|\mathcal{L}| \sum_{i=\hat{t}_{h}+1}^{t} \eta^{t-i} i^{-2} \\ = \eta^{t-\hat{t}_{h}}+4|\mathcal{L}| \sum_{i=\hat{t}_{h}+1}^{t} \eta^{t-i} i^{-2}, \end{aligned}$$

where the last equality holds since [TeX:] $$\operatorname{Pr}\left\{\mathbf{S}_{\hat{t}_{h}} \in \overline{\mathcal{S}}_{w}^{\alpha}\right\}=1$$ from the definition of [TeX:] $$\hat{t}_{h}.$$ Summing over [TeX:] $$t \in\left(\hat{t}_{h}, h\right]$$ on the both sides, and from [TeX:] $$\eta=1-\delta,$$ we haves

[TeX:] $$\sum_{t=\hat{t}_{h}+1}^{h} \operatorname{Pr}\left\{\mathbf{S}_{t} \in \overline{\mathcal{S}}_{\boldsymbol{w}}^{\alpha}\right\} \leq \frac{1-\delta}{\delta}+\frac{2|\mathcal{L}| \pi^{2}}{3 \delta}.$$

Combining (23) and (24), we obtain

[TeX:] $$\begin{aligned} \sum_{S \in \mathcal{S}_{w}^{\alpha}} \mathbb{E}\left[\hat{\tau}_{S}(h)\right] \\ =\sum_{S \in \mathcal{S}_{w}^{\alpha}} \mathbb{E}\left[\hat{\tau}_{S}\left(\hat{t}_{h}\right)\right]+\sum_{t=\hat{t}_{h}+1}^{h} \operatorname{Pr}\left\{\mathbf{S}_{t} \in \mathcal{S}_{\boldsymbol{w}}^{\alpha}\right\} \\ \leq D_{1} \cdot \frac{\ln h}{\left(\Delta_{\min }^{\alpha}\right)^{2}}+D_{2}, \end{aligned}$$

where [TeX:] $$D_{1}=\left(1+\frac{1}{\delta}\right) \cdot 4|\mathcal{L}|^{2}(|\mathcal{L}|+1) \cdot(|\mathcal{S}|-1), \text { and } D_{2}= \ \frac{|\mathcal{S}|-1}{\delta}\left(1+\frac{|\mathcal{L}| \delta \pi^{2}}{3_{\wedge}}+\frac{|\mathcal{L}|(|\mathcal{S}|-2) \pi^{2}}{6}\right)+\frac{1-\delta}{\delta}+\frac{2|\mathcal{L}| \pi^{2}}{3 \delta} .$$

(2) When [TeX:] $$\left.\hat{t}_{h}>h \text { (i.e,. } \exists S \text { such that } \hat{\tau}_{S}(h)l_{h}\right) \text { : }$$ With the same definitions of [TeX:] $$l_{h}, \overline{\mathcal{S}}(t) \text {, and } \underline{\mathcal{S}}(t), \text { let }|\overline{\mathcal{S}}|= \ |\overline{\mathcal{S}}(h)| \text { and }|\underline{\mathcal{S}}|_{-}=|\overline{\mathcal{S}}(h)|.$$ At this time, we define [TeX:] $$\overline{\mathcal{S}}(h)= \ \left\{S^{1}, S^{2}, \cdots, S^{|\mathcal{S}|}\right\} \text { and let } t^{n}$$ denote the time at which matching [TeX:] $$S^{n}$$ is sufficiently scheduled, i.e. [TeX:] $$\hat{\tau}_{S^{n}}\left(t^{n}\right)=l_{h} .$$ Without loss of generality, we assume [TeX:] $$t^{1}t^{2}\cdotst|\overline{\mathcal{S}}|.$$ By time slot hP , [TeX:] $$\underline{\mathcal{S}}(h)$$ is non-empty [TeX:] $$\left(\text { since } \hat{t}_{h}>h\right),$$ and it is clear that [TeX:] $$\sum_{S \in \underline{\mathcal{S}}(h)} \hat{\tau}_{S}(h) \leq l_{h}|\underline{\mathcal{S}}|.$$

Similar to the case when [TeX:] $$\hat{t}_{h} \leq h,$$ we can obtain

[TeX:] $$\begin{aligned} \sum_{S \in \mathcal{S}_{w}^{\alpha}} \mathbb{E}\left[\hat{\tau}_{S}(h)\right] \\ =\sum_{S \in \mathcal{S}(h)} \mathbb{E}\left[\hat{\tau}_{S}(h)\right]+\sum_{t=1}^{h} \sum_{S \in \bar{S}(h)} \operatorname{Pr}\left\{\mathbf{S}_{t}=S\right\} \\ \leq l_{h}|\mathcal{S}|+l_{h}|\overline{\mathcal{S}}|+\sum_{n=1}^{|\mathcal{S}|} \sum_{t=t^{n}+1}^{t^{n+1}} \Theta_{n}(t) \\ \leq l_{h} M+\sum_{n=1}^{|\overline{\mathcal{S}}|} \frac{1}{\delta}\left(1+\frac{\pi^{2}}{6} G_{n}+\mathbb{E}\left[\sum_{S \in \mathcal{S}\left(t^{n}\right)} \tau_{x, n+1}\right]\right), \end{aligned}$$

where the last inequality comes from Lemma 4. As in (23), we caPn obtain

[TeX:] $$\begin{aligned} \sum_{S \in \mathcal{S}_{w}^{\alpha}} \mathbb{E}\left[\hat{\tau}_{S}(h)\right] \\ \leq l_{h} M+\frac{|\mathcal{S}|}{\delta}\left(1+\frac{|\mathcal{L}| \pi^{2} \delta}{3}+\frac{|\mathcal{L}|(|\mathcal{S}|+1) \pi^{2}}{6}+l_{h}\right) \\ \leq D_{1} \frac{\ln h}{\left(\Delta_{\min }^{\alpha}\right)^{2}}+D_{2}, \end{aligned}$$

where the last inequality holds due to [TeX:] $$|\bar{S}| \leq M-1.$$ Proposition 1 can be obtained by applying (25) and (26) to the decomposition inequality (9).

Remarks: Despite the logarithmic bound, the algorithm may suffer from slow convergence due to large values of constant [TeX:] $$D_{1} \text { and } D_{2}.$$ On the other hand, the bound is quite loose because we consider each matching separately. In practice, a link belongs to multiple matchings, and thus it can learn much faster.

B. Scheduling Efficiency

We now consider the throughput performance of [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}$$ across multiple frames. It can be obtained through the Lyapunov technique with time unit of frame length.

Proposition 2. For a sufficiently large frame length [TeX:] $$T, A^{k}-$$ UCB is rate-stable for any arrival rate strictly inside [TeX:] $$\frac{k-1}{k+1} \Lambda.$$ Proof. Given any [TeX:] $$\lambda$$ strictly inside [TeX:] $$\alpha \Lambda \text { with } \alpha=\frac{k-1}{k+1},$$ we consider the Lyapunov function [TeX:] $$L\left(t_{n}\right)=\frac{1}{2} \sum_{i \in \mathcal{L}}\left(q_{i}\left(t_{n}\right)\right)^{2}$$ at the start time [TeX:] $$t_{n}$$ of the n-th frame. If the Lyapunov function has a negative drift for sufficiently large queue lengths, then all the queues will remain finite.

From the queue evolution (1), we have

[TeX:] $$\begin{aligned} q_{i}\left(t_{n+1}\right) \leq \left(q_{i}\left(t_{n}\right)-\sum_{t=t_{n}}^{t_{n}+T-1} X_{i}(t) \cdot \mathbb{I}\left\{i \in \mathbf{S}_{t}\right\}\right)^{+} \\ +\sum_{t=t_{n}}^{t_{n}+T-1} a_{i}(t), \end{aligned}$$

where [TeX:] $$\left\{\mathbf{S}_{t}\right\}$$ denotes the sequence of matchings chosen by [TeX:] $$\mathrm{A}^{k}$$ - UCB. Let [TeX:] $$D\left(t_{n}\right)=L\left(t_{n+1}\right)-L\left(t_{n}\right).$$ The drift during a frame time can be written as

[TeX:] $$\begin{aligned} \mathbb{E} \left[D\left(t_{n}\right) \mid \boldsymbol{q}\left(t_{n}\right)\right] \leq \frac{1}{2} \sum_{i \in \mathcal{L}} \mathbb{E}\left[\left(\sum_{t=t_{n}}^{t_{n}+T-1} a_{i}(t)\right)^{2} \mid \boldsymbol{q}\left(t_{n}\right)\right] \\ +\frac{1}{2} \sum_{i \in \mathcal{L}} \mathbb{E}\left[\left(\sum_{t=t_{n}}^{t_{n}+T-1} X_{i}(t) \cdot \mathbb{I}\left\{i \in \mathbf{S}_{t}\right\}\right)^{2} \mid \boldsymbol{q}\left(t_{n}\right)\right] \\ +\sum_{t=t_{n}}^{t_{n}+T-1} \mathbb{E}\left[\sum_{i \in \mathcal{L}} q_{i}\left(t_{n}\right) a_{i}(t)\right.\\ \left.\left.-\sum_{i \in \mathbf{S}_{t}} q_{i}\left(t_{n}\right) X_{i}(t)\right) \mid \boldsymbol{q}\left(t_{n}\right)\right], \end{aligned}$$

where the first two terms can be bounded by CT for some constant C, because [TeX:] $$a_{i}(t), X_{i}(t), \text { and }|\mathcal{L}|$$ are bounded. Suppose that we have weight vector w at time [TeX:] $$t_{n} . \text { Let } S^{*}$$ denote an optimal matchings during the corresponding frame time, i.e., [TeX:] $$S^{*} \in \mathcal{S}_{w}^{*}=\arg \max _{S \in \mathcal{S}} \sum_{i \in S} w_{i}, \text { and let } r_{w}^{*}=\sum_{i \in S} \cdot w_{i}.$$ Since [TeX:] $$\lambda$$ strictly inside [TeX:] $$\alpha \Lambda,$$ there exists [TeX:] $$\epsilon>0$$ such that [TeX:] $$\lambda+\epsilon 1 \in \alpha \Lambda,$$ where 1 is the vector of all ones. Then from [TeX:] $$w_{i}=\frac{q_{i}\left(t_{n}\right)}{q^{*}\left(t_{n}\right)} \mu_{i},$$ we can obtain

[TeX:] $$\begin{aligned} \mathbb{E}[D\left.\left(t_{n}\right) \mid \boldsymbol{q}\left(t_{n}\right)\right] \\ \leq C T+\sum_{t=t_{n}}^{t_{n}+T-1}\left(\mathbb{E}\left[\sum_{i \in \mathcal{L}} q_{i}\left(t_{n}\right) a_{i}(t) \mid \boldsymbol{q}\left(t_{n}\right)\right]\right.\\ \left.-\mathbb{E}\left[\sum_{i \in \mathbf{S}_{t}} q_{i}\left(t_{n}\right) X_{i}(t) \mid \boldsymbol{q}\left(t_{n}\right)\right]\right) \\ = C T+q^{*}\left(t_{n}\right) \sum_{t=t_{n}}^{t_{n}+T-1}\left(\sum_{i \in \mathcal{L}} \frac{q_{i}\left(t_{n}\right)}{q^{*}\left(t_{n}\right)} \lambda_{i}-\alpha r_{\boldsymbol{w}}^{*}\right) \\ +q^{*}\left(t_{n}\right) \sum_{t=t_{n}}^{t_{n}+T-1}\left(\alpha r_{\boldsymbol{w}}^{*}-\mathbb{E}\left[r_{\boldsymbol{w}}\left(\mathbf{S}_{t}\right) \mid \boldsymbol{q}\left(t_{n}\right)\right]\right) \\ \leq C T-\epsilon T \sum_{i \in \mathcal{L}} q_{i}\left(t_{n}\right)+q^{*}\left(t_{n}\right) \cdot \operatorname{Reg}^{\alpha}(T), \end{aligned}$$

where the equality holds due to the independence of link rates, aPnd the last inequality holds since [TeX:] $$\lambda+\epsilon 1 \in \alpha \Lambda$$ and thus [TeX:] $$\sum_{i \in \mathcal{L}} \frac{q_{i}\left(t_{n}\right)}{q^{*}\left(t_{n}\right)}\left(\lambda_{i}+\epsilon\right)\alpha r_{\boldsymbol{w}}^{*}.$$ Dividing both sides by T, we have

[TeX:] $$\frac{1}{T} \mathbb{E}\left[D\left(t_{n}\right) \mid \boldsymbol{q}\left(t_{n}\right)\right] \leq C-\epsilon \sum_{i \in \mathcal{L}} q_{i}\left(t_{n}\right)+q^{*}\left(t_{n}\right) \cdot \frac{\operatorname{Reg}^{\alpha}(T)}{T}.$$

Since Proposition 1 implies that [TeX:] $$\frac{\operatorname{Reg}^{\alpha}(T)}{T}\epsilon$$ for sufficiently large T, we have a negative drift for sufficiently large queue lengths.

Proposition 2 means that [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}$$ can stabilize the queue lengths under packet arrival dynamics for any [TeX:] $$\lambda \in \frac{k-1}{k+1} \Lambda.$$

V. DISTRIBUTED IMPLEMENTATION [TeX:] $$\left(d \mathrm{~A}^{k}-\mathrm{UCB}\right)$$

Although the augmentation algorithm has O(k) complexity and amenable to implement in a distributed fashion, the link index [TeX:] $$\bar{w}_{t} \text { of } \mathrm{A}^{k}-\mathrm{UCB}$$ includes global information [TeX:] $$q^{*}\left(t_{n}\right)-$$ the largest queue length in the network at the start of each frame n. This normalization is due to Hoeffding inequality and essential for the provable regret performance bound.

We pay attention to the fact that [TeX:] $$A^{k}-U C B$$ indeed learns the expected value of the queue weighted link rate, i.e., [TeX:] $$q_{i}\left(t_{n}\right) \mu_{i},$$ and the global information [TeX:] $$q^{*}\left(t_{n}\right)$$ takes the role of normalizing the weight in the range of [0, 1]. This implies that it may be able to separate the normalizing parameter from the learning. To this end, we develop a distributed version of [TeX:] $$\mathrm{A}^{k}-$$ UCB, denoted by [TeX:] $$d \mathrm{~A}^{k}-\mathrm{UCB},$$ and describe it with two key differences. For the ease of exposition, we assume that [TeX:] $$\mathcal{A}_{t}$$ consists of a single augmentation.

1) Local normalizer: Each node v maintains a local normalizer [TeX:] $$\tilde{q}_{v},$$ which is initialized to [TeX:] $$\max _{u \in \mathcal{N}(v)} q_{(u, v)}\left(t_{n}\right)$$ at the beginning of each frame n. At each time slot t, node v in an augmentation updates its local normalizer twice as follows. 1) In each initialization stage or path augmenting stage, the REQ message from u to v includes additional information of [TeX:] $$\tilde{q}_{u}.$$ The receiving node v sets [TeX:] $$\tilde{q}_{v} \leftarrow \max \left\{\tilde{q}_{u}, \tilde{q}_{v}\right\}.$$ This repeats while building the augmentation. After the cycle-checking stage, the terminus w has [TeX:] $$\tilde{q}_{w}=\tilde{q}^{*}$$ that is the largest local normalizer in the augmentation. 2) In the back-propagating stage, this value [TeX:] $$\tilde{q}^{*}$$ is back-propagated together and each node v in the augmentation sets [TeX:] $$\tilde{q}_{v} \leftarrow \tilde{q}^{*}.$$ Hence, at the end of the time slot, all the nodes in the augmentation have the same local normalizer [TeX:] $$\tilde{q}^{*}.$$

2) Separate gain computation: In the meantime, we change the way to compute the gain. To elaborate, let [TeX:] $$G_{u}^{\prime}$$ denote the new gain normalized by [TeX:] $$\tilde{q}_{u}.$$ At each mini-slot in the path augmenting stage, whenever node u transmits an REQ message to node v, we divide [TeX:] $$G_{u}^{\prime} \text { into } G_{u, 1}^{\prime}+G_{u, 2}^{\prime},$$ where [TeX:] $$G_{u, 1}^{\prime}$$ is for average reward (normalized by factor [TeX:] $$\tilde{q}_{u} \text { ) and } G_{u, 2}^{\prime}$$ for confidence interval. They are included in the REQ message, separately. Then, after the receiving node v updates the local normalizer [TeX:] $$\tilde{q}_{v},$$ it renormalizes the received reward gain as [TeX:] $$G_{u, 1}^{\prime} \cdot \tilde{q}_{u} / \tilde{q}_{v}.$$ Once the next link is decided as i = (v, n), it computes [TeX:] $$G_{v, 1}^{\prime}$$ accordingly by either adding or subtracting its average reward normalized by [TeX:] $$\tilde{q}_{v}, \text { i.e., } \hat{w}_{i}^{\prime}(t)=\frac{q_{i}\left(t_{n}\right)}{\tilde{q}_{v}}.$$ [TeX:] $$\frac{1}{\hat{\tau}_{i}(t)} \sum_{j=t_{n}+1}^{t} X_{i}(j) \cdot \mathbb{I}\left\{i \in S_{j}\right\} . G_{u, 2}^{\prime}$$ can be obtained simply by adding the confidence interval. Let [TeX:] $$A^{\prime}$$ is the augmentation up to node v, and let [TeX:] $$A_{1}^{\prime}=A^{\prime}-\mathbf{S}_{t-1}$$ and let [TeX:] $$A_{2}^{\prime}=A^{\prime} \cap \mathbf{S}_{t-1}.$$ Then the gains

[TeX:] $$\begin{aligned} G_{v, 1}^{\prime} =\sum_{i \in A_{1}^{\prime}} \hat{w}_{i}^{\prime}-\sum_{j \in A_{2}^{\prime}} \hat{w}_{j}^{\prime} \\ =\frac{\tilde{q}_{u}}{\tilde{q}_{v}} G_{u, 1}^{\prime}+\left(\mathbb{I}\left\{v \in A_{1}^{\prime}\right\}-\mathbb{I}\left\{v \in A_{2}^{\prime}\right\}\right) \cdot \hat{w}_{v}^{\prime}, \\ G_{v, 2}^{\prime} =\sum_{i \in A_{1}^{\prime}} \sqrt{\frac{(|\mathcal{L}|+1) \ln t}{\hat{\tau}_{i}}}-\sum_{j \in A_{2}^{\prime}} \sqrt{\frac{(|\mathcal{L}|+1) \ln t}{\hat{\tau}_{j}}} \\ =G_{u, 2}^{\prime}+\left(\mathbb{I}\left\{v \in A_{1}^{\prime}\right\}-\mathbb{I}\left\{v \in A_{2}^{\prime}\right\}\right) \cdot \sqrt{\frac{(|\mathcal{L}|+1) \ln t}{\hat{\tau}_{v}}}, \end{aligned}$$

can be computed given the value of [TeX:] $$\tilde{q}_{u}$$ and the gains [TeX:] $$G_{u, 1}^{\prime}, G_{u, 2}^{\prime}.$$By repeating this during the augmenting

Fig. 3.

Regret traces for learning performance. For comparison across frames, the regret is reset to 0 at each frame boundary [TeX:] $$\left(T=10^{6}\right. \text { time slots), }$$ and normalized by the maximum expected reward [TeX:] $$r_{\boldsymbol{w}}^{*}.$$

Fig. 4.

Queue lengths for scheduling efficiency. Given a scheduler, if the arrival rate gets closer to the boundary of its stability region, the queue length soars quickly.

stage, we can obtain the gain normalized by [TeX:] $$\tilde{q}^{*}$$ at the terminus.

Remarks: During a frame time, the local normalizer of a node is non-decreasing over time slots. In addition, at the same time slot, two nodes in the network may have a different normalizer value. Hence, our previous analysis results for [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}$$ cannot be directly applied to [TeX:] $$d \mathrm{~A}^{k}-\mathrm{UCB}.$$ However, we highlight that, given a time slot, all the nodes in the same augmentation have the same value of the (local) normalizer, which is of importance, since the gain comparison for making a decision occurs only within an augmentation. On the other hand, as the time slot t increases, the value of the global normalizer [TeX:] $$q^{*}\left(t_{n}\right)$$ is disseminated throughout the network and all the local normalizers will converge to this value. Considering that, it is not difficult to show that there exists some T' such that all nodes v have [TeX:] $$\tilde{q}_{v}=q^{*}\left(t_{n}\right)$$ with probability close to 1 for all [TeX:] $$t>T^{\prime},$$ we believe that [TeX:] $$d \mathrm{~A}^{k}-\mathrm{UCB}$$ also achieves O(log T) regret performance and [TeX:] $$\frac{k-1}{k+1} \Lambda$$ capacity, if the frame length T is sufficiently large. Rigorous proof remains as future work.


We evaluate the performance of our proposed schemes through simulations. We first consider a 4x4 grid network topology with the primary interference model, and then conduct extend simulations with a randomly generated network. Time is slotted. At each time slot, a packet arrives at link i with probability [TeX:] $$\lambda_{i},$$ and for a scheduled link j, a packet successfully departs the link with probability [TeX:] $$\mu_{j}$$ Both [TeX:] $$\lambda \text { and } \mu$$ are unknown to the controller. The arrivals and the departures are independent across the links and time slots.

Regret performance: We first investigate the regret performance of [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB} \text { and } d \mathrm{~A}^{k}-\mathrm{UCB}.$$ We set the seed probability p = 0.2 for the schemes. We consider a large frame length of [TeX:] $$T=10^{6}$$ time slots to observe their regret growth. An identical arrival rate [TeX:] $$\lambda_{i}=0.08$$ is set for all links i, and the departure rate [TeX:] $$\mu_{i}$$ is set uniformly at random in range [TeX:] $$[0.25,0.75]^{5}.$$ We simulate the two schemes with different k’s and measure their regret performance. For the comparison across frames, the regret value is set to 0 at each frame start, and normalized with respect to the maximum expected reward sum [TeX:] $$r_{\boldsymbol{w}}^{*}$$ within the frame.

Fig. 3(a) illustrates the regret traces of [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB} \text {, }$$ which is an average of 10 simulation runs. We can observe the logarithmic regret growth in both cases. Recall that our regret analysis in Proposition 1 is for [TeX:] $$\alpha$$-regret with [TeX:] $$\alpha=\frac{k-1}{k+1}.$$ Thus empirical performance of the proposed schemes are much better than the analytical bound. The performance in terms of [TeX:] $$\alpha \text {-regret }$$ is shown in Fig. 3(b), where the gap from 0 can be interpreted as the level of practical difficulty in achieving analytic performance bound: as k increases, it harder to achieve [TeX:] $$\frac{k-1}{k+1} \text {-regret. }$$ Fig. 3(c) shows the regret values of [TeX:] $$d \mathrm{~A}^{k}-\mathrm{UCB}.$$ Comparing with those of [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB},$$ they achieve similar learning performance in terms of regret, and thus we can conclude that the performance loss due to local normalizer

Fig. 5.

Stability in different network topologies. [TeX:] $$d \mathrm{~A}^{k}-\mathrm{UCB}$$ achieves high performance in a larger randomly-generated network, and GMM with UCB may suffer from low performance in a specific ring network.

in [TeX:] $$d \mathrm{~A}^{k}-\mathrm{UCB}$$ is negligible in practice.

Scheduling efficiency: We evaluate scheduling efficiency of the proposed schemes in the grid network. We use the same simulation settings, but change frame length T and arrival rate [TeX:] $$\lambda\left(\lambda_{i}=\lambda \text { for all } i\right. \text { ). }$$ By increasing [TeX:] $$\lambda,$$ the arrival rate gets closer to the boundary of its stability region, and when this occurs, the queue length will soar quickly. We conduct each simulation for [TeX:] $$10^{6}$$ time slots and measure queue lengths when the simulations end. For each [TeX:] $$\lambda,$$ we average the queue length of 10 simulation runs. By observing the arrival rate where the queue length starts increasing quickly, we can indirectly compare the achievable stability region [TeX:] $$\gamma \Lambda$$ for different scheduling policies [7]. A policy with larger [TeX:] $$\gamma$$ is better.

Fig. 4(a) demonstrates how the bound changes according to the frame length. From the results, we can observe that the critical point of [TeX:] $$\lambda,$$ around which the queue length starts soaring, is increasing for [TeX:] $$T \leq 10^{6}$$ and then decreasing for [TeX:] $$T \geq \ 2 \cdot 10^{6}.$$ This is somewhat expected, since too small frame length will lead to incomplete learning, and a larger frame length results in a relatively slower response to the queue dynamics.

We now evaluate the performance of [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}$$ in comparison with two schemes of MWM and UCB-based GMM. MWM is a well-known optimal scheduler [8], and it is a centralized algorithm that not only requires the knowledge about the weight [TeX:] $$q_{i}(t) \mu_{i}$$ at each time slot t, but also has a highorder computational complexity. In our simulations, we use its performance as a reference value. The UCB-based GMM [19]) finds a matching by including the link with the highest UCB index first. It is known to achieves [TeX:] $$\frac{1}{2} \Lambda$$ and has the linear computational complexity.

Fig. 4(b) demonstrate the queue lengths of MWM, UCBbased GMM, and [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}.$$ MWM operates at each time slot, and for UCB-based GMM and [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB},$$ we use T = 5000. Each simulation runs for 106 time slots (i.e., 200 frames) and we measure the queue lengths after the simulations. For MWM, the queue length quickly increases at around [TeX:] $$\lambda=0.09,$$ which can be considered as the boundary of the capacity region [TeX:] $$\Lambda.$$ Under [TeX:] $$\mathrm{A}^{k} \text {-UCB with } k=2,3,4 \text {, }$$ the queue lengths increase quickly around [TeX:] $$\lambda=0.084$$ for all k, which exceeds their theoretic bound [TeX:] $$\frac{k-1}{k+1} \cdot 0.09=0.03,0.045,0.054,$$ respectively. Interestingly, the impact of k on throughput is not significant, which seems to be due to the small network size – we can observe that a larger k leads to lower queue lengths in all arrival rates and thus achieves better delay performance. The UCB-based GMM achieves the performance closet to that of MWM, which is also far beyond its theoretic bound [TeX:] $$\frac{1}{2} .$$ Fig. 4(c) shows that [TeX:] $$d \mathrm{~A}^{k}-\mathrm{UCB}$$ achieves almost the same performance as [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}.$$

Performance in randomly generated networks: We now evaluate the performance of the proposed schemes in a larger, irregular-shaped network. To this end, we randomly generate a network of 50 nodes and 200 links as shown in Fig. 5(a), and run simulations for 1000 frame times with T = 500 (i.e., total [TeX:] $$5 \cdot 10^{5}$$ time slots). For each link i, we set the successful transmission rate [TeX:] $$\mu_{i}$$ uniformly at random in range [0.25, 0.75], and set the arrival rate as [TeX:] $$\lambda_{i}=\lambda \cdot \rho_{i},$$ where [TeX:] $$\rho_{i}$$ is chosen uniformly at random in range [0.4, 0.7]. Due to high computational complexity of MWM, we simulate only UCBbased GMM, [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}, \text { and } d \mathrm{~A}^{k}-\mathrm{UCB}$$ in this experiment, and use the performance of UCB-based GMM as a reference value. It has been observed that GMM algorithm often achieves the optimal scheduling performance in this randomized network environment [7].

Fig. 5(b) demonstrates the queue lengths of UCB-based GMM and [TeX:] $$d \mathrm{~A}^{k}-\mathrm{UCB} \text { with } k=3,6,9 \text {. }$$ All 4 schemes achieve almost-identical performance, and the setting of k is not sensitive to the performance. The results of [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB}$$ are almost identical to that of [TeX:] $$d \mathrm{~A}^{k}-\mathrm{UCB},$$ and thus omitted.

Low performance of UCB-based GMM: So far, UCB-based GMM achieves close-to-optimal performance despite its low performance guarantee of [TeX:] $$\frac{1}{2} \Lambda.$$ It is an interesting question whether the performance bound is not tight due to technical difficulties and its true performance is close to the optimal. Unfortunately, however, we shows in the next experiment that this is not the case, and UCB-based GMM may suffer from low performance in a certain scenario. We consider a 6-link ring topology, where the links are numbered from 1 to 6 in a clockwise direction. The service rate of each link follows a Bernoulli distribution with mean [TeX:] $$\frac{1}{2}$$ and the packet arrival on each link is also a Bernoulli process with mean6 [TeX:] $$\frac{1}{6}+\epsilon \text { where } \epsilon=0.08.$$ We set the frame length T = 6000. Other environment settings are the same as before, except that the initial queue length is [TeX:] $$\left\{\frac{3 T}{6}, \frac{2 T}{6}, \frac{T}{6}, \frac{3 T}{6}, \frac{2 T}{6}, \frac{T}{6}\right\}.$$ Fig. 5(c) shows the queue length traces for UCB-based GMM and our proposed schemes. We can observe that while the queue lengths of [TeX:] $$\mathrm{A}^{k}-\mathrm{UCB} \text { and } d \mathrm{~A}^{k}-\mathrm{UCB}$$ are stabilized, those of the UCB-based GMM keep increasing. This is because, the greedy algorithm tends select a matching with the two links of the largest queue at the beginning of each frame. In contrast, [TeX:] $$\mathrm{A}^{k}-$$ UCB and [TeX:] $$d \mathrm{~A}^{k}-\mathrm{UCB}$$ select a matching with three links by considering their weight sum. This result implies that in a certain circumstance, UCB-based GMM may suffer from low scheduling efficiency.

6 This implies that [TeX:] $$\lambda \in\left(\frac{2}{3}+4 \epsilon\right) \Lambda$$ since an optimal scheduler can support arrival rate of up to [TeX:] $$\frac{1}{4}$$ on each link.


In this work, we addressed the joint problem of learning and scheduling in multi-hop wireless networks. Without a priori knowledge on link rates, we aim to find a sequence of schedules such that all the queue lengths remain finite under packet arrival dynamics. By incorporating the augmentation algorithm into a learning procedure, we develop provably efficient lowcomplexity schemes that i) achieve logarithmic regret growth in learning, and ii) have the throughput performance that can be arbitrarily close to the optimal. We extend the result to a distributed scheme that is amenable to implement in largescale networks. We also verify our results through simulations.


Daehyun Park

Daehyun Park received his M.S. degree from the school of ECE at Ulsan National Institute of Science and Technology (UNIST) in 2020. His research interests include multi-armed bandits.


Sunjung Kang

Sunjung Kang received her M.S. degree from the school of ECE at Ulsan National Institute of Science and Technology (UNIST) in 2018. She is currently a Ph.D. student in the department of ECE at The Ohio State University. Her research interests include the age of information, remote estimation and multiarmed bandits.


Changhee Joo

Changhee Joo received the Ph.D. degree from Seoul National University in 2005. He was with Purdue University and The Ohio State University, and then worked at Korea University of Technology and Ed- ucation (KoreaTech), and Ulsan National Institute of Science and Technology (UNIST). Since 2019, he has been with Korea University. His research interests are in the broad areas networking, learning, modeling, and optimization. He was a recipient of the IEEE INFOCOM 2008 Best Paper Award, the KICS Haedong Young Scholar Award (2014), the ICTC 2015 Best Paper Award, and the GAMENETS 2018 Best Paper Award. He was an Associate Editor of the IEEE/ACM Transactions on Networking, and currently an Editor of the IEEE Transactions Vehicular Technology, a Division Editor of the Journal of Communications and Networks, and has served several primary conferences as a technical program committee member, including IEEE INFOCOM, ACM MOBIHOC, IEEE WiOpt, and IEEE GLOBECOM.


  • 1 C. Joo, G. Sharma, N. B. Shroff, R. R. Mazumdar, "On the complexity of scheduling in wireless networks," in EURASIP J. Wireless Commun. Netw., Oct, 2010;custom:[[[-]]]
  • 2 X. Lin, N. B. Shroff, "The impact of imperfect scheduling on crosslayer congestion control in wireless networks," IEEE /ACM Trans. Netw., vol. 14, no. 2, pp. 302-315, Apr, 2006.custom:[[[-]]]
  • 3 C. Joo, X. Lin, N. B. Shroff, "Greedy maximal matching: Performance limits for arbitrary network graphs under the node-exclusive interference model," IEEE Trans. Autom. Control, vol. 54, no. 12, pp. 2734-2744, Dec, 2009.doi:[[[10.1109/TAC.2009.2031719]]]
  • 4 C. Joo, N. B. Shroff, "Local greedy approximation for scheduling in multi-hop wireless networks," IEEE Trans. Mobile Comput., vol. 11, no. 3, pp. 414-426, Mar, 2012.custom:[[[-]]]
  • 5 X. Wu, R. Srikant, J. R. Perkins, "Scheduling efficiency of distributed greedy scheduling algorithms in wireless networks," in IEEE Trans. Mobile Comput., 2007;vol. 6, no. 6, pp. 595-605. custom:[[[-]]]
  • 6 X. Lin, S. B. Rasool, "Distributed and provably efficient algorithms for joint channel-assignment, scheduling, and routing in multichannel Ad hoc wireless networks," IEEE /ACM Trans. Netw., vol. 17, no. 6, pp. 1874-1887, 2009.custom:[[[-]]]
  • 7 C. Joo, N. B. Shroff, "Performance of random access scheduling schemes in multi-hop wireless networks," in IEEE /ACM Trans. Netw., Oct, 2009;vol. 17, no. 5. custom:[[[-]]]
  • 8 L. Tassiulas, A. Ephremides, "Stability properties of constrained queueing systems and scheduling policies for maximal throughput in multihop radio networks," IEEE Trans. Autom. Control, vol. 37, no. 12, pp. 1936-1948, Dec, 1992.custom:[[[-]]]
  • 9 J. Choi, "On improving throughput of multichannel ALOHA using preamble-based exploration," J. Commun. Netw., vol. 22, no. 5, pp. 380389-380389, 2020.custom:[[[-]]]
  • 10 B. Hajek, G. Sasaki, "Link scheduling in polynominal time," IEEE Trans. Inf. Theory, vol. 34, no. 5, Sept, 1988.custom:[[[-]]]
  • 11 L. Bui, S. Sanghavi, R. Srikant, "Distributed link scheduling with constant overhead," IEEE /ACM Trans. Netw., vol. 17, no. 5, pp. 14671480-14671480, Oct, 2009.custom:[[[-]]]
  • 12 L. Jiang, J. Walrand, "A distributed CSMA algorithm for throughput and utility maximization in wireless networks," in IEEE /ACM Trans. Netw., June, 2010;vol. 18, no. 13, pp. 960-972. custom:[[[-]]]
  • 13 J. Ni, B. Tan, R. Srikant, "Q-CSMA: Queue-length based CSMA/CA algorithms for achieving maximum throughput and low delay in wireless networks," in IEEE /ACM Trans. Netw., June, 2012;vol. 20, no. 3. custom:[[[-]]]
  • 14 C. Joo, "On random access scheduling for multimedia traffic in multihop wireless networks," IEEE Trans. Mobile Comput., vol. 12, no. 4, pp. 647-656, Apr, 2013.custom:[[[-]]]
  • 15 S. A. Borbash, A. Ephremides, "Wireless link scheduling with power control and SINR constraints," IEEE Trans. Inf. Theory, vol. 52, no. 11, pp. 5106-5111, Nov, 2006.doi:[[[10.1109/TIT.2006.883617]]]
  • 16 J.-G. Choi, C. Joo, J. Zhang, N. B. Shroff, "Distributed link scheduling under SINR model in multihop wireless networks," IEEE /ACM Trans. Netw., vol. 22, no. 4, pp. 1204-1217, Aug, 2014.doi:[[[10.1109/TNET.2013.2273100]]]
  • 17 F. Li, D. Yu, H. Yang, J. Yu, H. Karl, X. Cheng, "Multi-amedbandit-based spectrum scheduling algorithms in wireless networks: A survey," IEEE Wireless Commun., vol. 27, no. 1, pp. 24-30, 2020.custom:[[[-]]]
  • 18 Q. Zhao, L. Tong, A. Swami, Y. Chen, "Decentralized cognitive MAC for opportunistic spectrum access in Ad hoc networks: A POMDP framework," IEEE J. Sel. Areas Commun., vol. 25, no. 3, pp. 589-600, Apr, 2007.doi:[[[10.1109/JSAC.2007.070409]]]
  • 19 T. Stahlbuhk, B. Shrader, E. Modiano, "Learning algorithms for scheduling in wireless networks with unknown channel statistics," Ad Hoc Netw., vol. 85, pp. 131-144, 2019.custom:[[[-]]]
  • 20 W. Chen, Y. Wang, Y. Yuan, "Combinatorial multi-armed bandit: General framework and applications," in Proc. ICML, 2013.custom:[[[-]]]
  • 21 T. Lai, H. Robbins, "Asymptotically efficient adaptive allocation rules," Adv. Appl. Math., vol. 6, no. 1, pp. 4-22, Mar, 1985.custom:[[[-]]]
  • 22 P. Auer, N. Cesa-Bianchi, P. Fischer, "Finite-time analysis of the multiarmed bandit problem," Machine Learning, vol. 47, no. 2, pp. 235256-235256, May, 2002.doi:[[[10.1023/A:1013689704352]]]
  • 23 V. Anantharam, P. Varaiya, J. Walrand, "Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple playspart I: I.I.D. rewards," IEEE Trans. Autom. Control, vol. 32, no. 11, pp. 968-976, Nov, 1987.custom:[[[-]]]
  • 24 K. Liu, Q. Zhao, "Distributed learning in multi-armed bandit with multiple players," IEEE Trans. Signal Processing, vol. 58, no. 11, pp. 5667-5681, Nov, 2010.doi:[[[10.1109/TSP.2010.2062509]]]
  • 25 Y. Gai, B. Krishnamachari, R. Jain, "Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations," IEEE /ACM Trans. Netw., vol. 20, no. 5, pp. 1466-1478, Oct, 2012.doi:[[[10.1109/TNET.2011.2181864]]]
  • 26 Y. Gai, B. Krishnamachari, "Decentralized online learning algorithms for opportunistic spectrum access," in Proc. IEEE GLOBECOM, Dec, 2011;custom:[[[-]]]
  • 27 A. Anandkumar, N. Michael, A. K. Tang, A. Swami, "Distributed algorithms for learning and cognitive medium access with logarithmic regret," IEEE J. Sel. Areas Commun., vol. 29, no. 4, pp. 731-745, Apr, 2011.doi:[[[10.1109/JSAC.2011.110406]]]
  • 28 H. Tibrewal, S. Patchala, M. K. Hanawal, S. J. Darak, "Distributed learning and optimal assignment in multiplayer heterogeneous networks," in Proc. IEEE INFOCOM, 2019;custom:[[[-]]]
  • 29 S. Kang, C. Joo, "Low-complexity learning for dynamic spectrum access in multi-user multi-channel networks," in IEEE Trans. Mobile Comput., 2021;custom:[[[-]]]
  • 30 D. Park, S. Kang, and C. Joo, Distributed link scheduling with unknown link rates in multi-hop wireless networks, https://www.dropbox.com/s/uh07dc2xbj55dgm/techreport.pdf?dl=0,2021