We consider multi-armed bandit problems in social groups wherein each individual has bounded memory and shares the common goal of learning the best arm/option. We say an individual learns the best option if eventually (as $t\diverge$) it pulls only the arm with the highest expected reward. While this goal is provably impossible for an isolated individual due to bounded memory, we show that, in social groups, this goal can be achieved easily with the aid of social persuasion (i.e., communication) as long as the communication networks/graphs satisfy some mild conditions. In this work, we model and analyze a type of learning dynamics which are well-observed in social groups. Specifically, under the learning dynamics of interest, an individual sequentially decides on which arm to pull next based on not only its private reward feedback but also the suggestion provided by a randomly chosen neighbor. To deal with the interplay between the randomness in the rewards and in the social interaction, we employ the \em mean-field approximation method. Considering the possibility that the individuals in the networks may not be exchangeable when the communication networks are not cliques, we go beyond the classic mean-field techniques and apply a refined version of mean-field approximation: \beginitemize \item Using coupling we show that, if the communication graph is connected and is either regular or has doubly-stochastic degree-weighted adjacency matrix, with probability $\to 1$ as the social group size $N \to \infty $, every individual in the social group learns the best option. \item If the minimum degree of the graph diverges as $N \diverge $, over an arbitrary but given finite time horizon, the sample paths describing the opinion evolutions of the individuals are asymptotically independent. In addition, the proportions of the population with different opinions converge to the unique solution of a system of ODEs. Interestingly, the obtained system of ODEs are invariant to the structures of the communication graphs. In the solution of the obtained ODEs, the proportion of the population holding the correct opinion converges to $1$ exponentially fast in time. \enditemize Notably, our results hold even if the communication graphs are highly sparse.

Epidemic models accurately represent (among other processes) the spread of diseases, information (rumors, viral videos, news stories, etc.), the spread of malevolent agents in a network (computer viruses, malicious apps, etc.), or even biological processes (pathways in cell signaling networks, chains of activation in the gene regulatory network, etc.). We focus on epidemics that spread on an underlying graph [5].

In this paper, we consider a load balancing system under a general pull-based policy. In particular, each arrival is randomly dispatched to any server whose queue length is below a threshold; if no such server exists, then the arrival is randomly assigned to any server. We are interested in the fundamental relationship between the threshold and the delay performance of the system in heavy traffic. To this end, we first establish the following necessary condition to guarantee heavy-traffic delay optimality: the threshold needs to grow to infinity as the exogenous arrival rate approaches the boundary of the capacity region (i.e., the load intensity approaches one) but the growth rate should be slower than a polynomial function of the mean number of tasks in the system. As a special case of this result, we directly show that the delay performance of the popular pull-based policy Join-Idle-Queue (JIQ) is not heavy traffic optimal, but performs strictly better than random routing. We further show that a sufficient condition for heavy-traffic delay optimality is that the threshold grows logarithmically with the mean number of tasks in the system. This result directly resolves a generalized version of the conjecture by Kelly and Laws.

Load balancing plays a crucial role in achieving low latency in large distributed systems. Recent load balancing strategies often rely on replication or use placeholders to further improve latency. However assessing the performance and stability of these strategies is challenging and is therefore often simulation based. In this paper we introduce a unified approach to analyze the performance and stability of a broad class of workload dependent load balancing strategies. This class includes many replication policies, such as replicate below threshold, delayed replication and replicate only small jobs, as well as strategies for fork-join systems.

We consider systems with general job size distributions where jobs may experience server slowdown. We show that the equilibrium workload distribution of the cavity process satisfies a functional differential equation and conjecture that the cavity process captures the limiting behavior of the system as its size tends to infinity.

We study this functional differential equation in more detail for a variety of load balancing policies and propose a numerical method to solve it. The numerical method relies on a fixed point iteration or a simple Euler iteration depending on the type of functional differential equation involved. We further show that additional simplifications can be made if certain distributions are assumed to be phase-type.

Various numerical examples are included that validate the numerical method and illustrate its strength and flexibility.

Load balancing systems, comprising a central dispatcher and a scheduling policy at each server, are widely used in practice, and their response time has been extensively studied in the theoretical literature. While much is known about the scenario where the scheduling at the servers is First-Come-First-Served (FCFS), to minimize mean response time we must use Shortest-Remaining-Processing-Time (SRPT) scheduling at the servers. Much less is known about dispatching polices when SRPT scheduling is used. Unfortunately, traditional dispatching policies that are used in practice in systems with FCFS servers often have poor performance in systems with SRPT servers. In this paper, we devise a simple fix that can be applied to any dispatching policy. This fix, called guardrails, ensures that the dispatching policy yields optimal mean response time under heavy traffic when used in a system with SRPT servers. Any dispatching policy, when augmented with guardrails, becomes heavy-traffic optimal. Our results yield the first analytical bounds on mean response time for load balancing systems with SRPT scheduling at the servers.

In multi-server systems, a classical job assignment algorithm works as follows: at the arrival of each job, pick d servers independently and uniformly at random and send the job to the least loaded server among the d servers. This model is known as the power-of-d choices algorithm. In this paper, we analyze a variant of this algorithm, where d servers are sampled through d independent non-backtracking random walks on a k-regular graph. The random walkers are periodically reset to independent uniform random positions. Under some assumptions on the underlying graph, we show that the system dynamics under this new algorithm converges to the solution of a deterministic ordinary differential equation (ODE), which is the same ODE as the classical power-of-d choices. We also show that the new algorithm stablizes the system, and the stationary distribution of the system converges to the stationary solution of the ODE. The new scheme can be considered as a derandomized version of power-of-d choices as it reduces the use of randomness while maintaining the performance of power-of-d choices.

This paper focuses on network resilience to perturbation of edge weight. Other than connectivity, many network applications nowadays rely upon some measure of network distance between a pair of connected nodes. In these systems, a metric related to network functionality is associated to each edge. A pair of nodes only being functional if the weighted, shortest-path distance between the pair is below a given threshold T. Consequently, a natural question is on which degree the change of edge weights can damage the network functionality? With this motivation, we study a new problem, Quality of Service Degradation : given a set of pairs, find a minimum budget to increase the edge weights which ensures the distance between each pair exceeds T. We introduce four algorithms with theoretical performance guarantees for this problem. Each of them has its own strength in trade-off between effectiveness and running time, which are illustrated both in theory and comprehensive experimental evaluation.

Streaming data analytics has been an important topic of research in recent years. Large quantities of data are generated continuously over time across a variety of application domains such as web and social analytics, scientific computing and energy analytics. One of the key requirements in modern data analytics services is the real-time analysis of these data streams to extract useful and timely information for the analyst. Several distributed data analytics platforms have been developed in recent times to meet this growing requirement of real-time streaming analytics. Nowadays, a large amount of data is generated continuously by geographically distributed sources (e.g., agents, sensors, mobile devices, edge nodes, etc.) in many streaming applications. For instance, services like Facebook, Twitter and Netflix continuously gather data from the end users for a variety of analytical purposes such as finding the popular web content amongst their users or monitoring the QoS metrics. Large content delivery networks (CDNs) like Akamai that serve a significant fraction of content on the Internet continuously collect data from their edge servers and clients from around the globe to understand what, where and how content is accessed for the purpose of providing content analytics insights to businesses.

Is it possible to design a packet-sampling algorithm that prevents the network node that performs the sampling from treating the sampled packets preferentially? We study this problem in the context of designing a "network-transparency'' system. In this system, networks emit receipts for a small sample of the packets they observe, and a monitor collects these receipts to estimate each network's loss and delay performance. Sampling is a good building block for this system, because it enables a solution that is flexible and combines low resource cost with quantifiable accuracy. The challenge is cheating resistance: when a network's performance is assessed based on the conditions experienced by a small traffic sample, the network has a strong incentive to treat the sampled packets better than the rest. We contribute a sampling algorithm that is provably robust to such prioritization attacks, enables network performance estimation with quantifiable accuracy, and requires minimal resources. We confirm our analysis using real traffic traces.

BGP is the de-facto Internet routing protocol for interconnecting Autonomous Systems (AS). Each AS selects its preferred routes based on its routing policies, which are typically not disclosed. Due to the distributed route selection and information hiding, answering questions such as "what is the expected catchment of the anycast sites of a content provider at the AS-level, if new sites are deployed?", or "how will load-balancing behave if an ISP changes its routing policy for a prefix?", is a hard challenge. In this work, we propose a framework and methodology to infer the routing behavior in existing or hypothetical routing configurations, and provide new capabilities and insights for informative route inference (e.g., isolating the effect of randomness that is present in prior simulation-based approaches). The proposed framework can be useful in a number of applications: measurements/monitoring, traffic engineering, network planning, Internet routing models, etc.

Emerging non-volatile memory (NVM) technologies offer greater capacity than DRAM. Unfortunately, production NVM exhibits high latency and low write endurance. Hybrid memory combines DRAM and NVM to deliver greater capacity, low latency, high en- durance, and low energy consumption. Write-rationing garbage col- lection mitigates NVM wear-out by placing highly-written objects in DRAM and the rest in NVM. Existing write-rationing garbage collectors dynamically monitor object writes to place highly writ- ten objects in DRAM. Unfortunately, monitoring writes incurs a non-negligible performance overhead. This work proposes Crystal Gazer, profile-driven write-rationing garbage collection for hybrid memories. Allocation sites are statically profiled, and highly writ- ten objects are predicted based on previous program executions. Unlike prior work, this paper exposes a Pareto trade-off between DRAM usage and NVM lifetime. Experimental results on an emula- tion platform show that Crystal Gazer eliminates the performance overhead of dynamic monitoring, while reducing more NVM writes than state-of-the-art write-rationing garbage collectors.

Observing that many application programs from different domains can live with less-than-perfect accuracy, existing techniques try to trade off program output accuracy with performance-energy savings. While these works provide point solutions, they leave three critical questions regarding approximate computing unanswered: (i) what is the maximum potential of skipping (i.e., not performing) data accesses under a given inaccuracy bound?; (ii) can we identify the data accesses to drop randomly, or is being architecture aware critical?; and (iii) do two executions that skip the same number of data accesses always result in the same output quality (error)? This paper first provides answers to these questions using ten multithreaded workloads, and then presents a program slicing-based approach that identifies the set of data accesses to drop. Results indicate 8.8% performance improvement and 13.7% energy saving are possible when we set the error bound to 2%, and the corresponding improvements jump to 15% and 25%, respectively, when the error bound is raised to 4%.

Dynamic parallelism (DP) is a new feature of emerging GPUs that allows new kernels to be generated and scheduled from the device-side (GPU) without the host-side (CPU) intervention. To efficiently support DP, one of the major challenges is to saturate the GPU processing elements and provide them with the required data in a timely fashion. In this paper, we first conduct a limit study on the performance improvements that can be achieved by hardware schedulers that are provided with accurate data reuse information. We next propose LASER, a Locality-Aware SchedulER, where the hardware schedulers employ data reuse monitors to help make scheduling decisions to improve data locality at runtime. Experimental results on 16 benchmarks show that LASER, on an average, can improve performance by 11.3%.

The cost of moving data between compute elements and storage elements plays a significant role in shaping the overall performance of applications. We present a compiler-driven approach to reducing data movement costs. Our approach, referred to as Computing with Near Data (CND), is built upon a concept called "recomputation", in which a costly data access is replaced by a few less costly data accesses plus some extra computation, if the cumulative cost of the latter is less than that of the costly data access. Experimental result reveals that i) the average recomputability across our benchmarks is 51.1%, ii) our compiler-driven strategy is able to exploit 79.3% of the recomputation opportunities presented by our workloads, and iii) our enhancements increase the value of the recomputability metric significantly.

We study dynamic pricing of resources that are distributed over a network of locations (e.g., shared vehicle systems and logistics networks). Customers with private willingness-to-pay sequentially request to relocate a resource from one location to another. We focus on networks with a hub-and-spoke structure. We develop a dynamic pricing policy and a performance bound based on a Lagrangian relaxation. This relaxation decomposes the problem over spokes and is thus far easier to solve than the original problem. We analyze the performance of the Lagrangian-based policy and focus on a large network regime in which the number of spokes (n) and number of resources grow at the same rate. We show that our policy loses no more than $O\Big(\sqrtłn n/n \Big)$ in performance compared to an optimal policy, thus implying asymptotic optimality as n grows large. We provide examples that show that upper bounds and static policies based on fluid relaxations fail to work well in this asymptotic regime. Finally, we discuss how our approach extends to more general networks involving multiple, interconnected hubs.

A core tension in the operations of online marketplaces is between segmentation (wherein platforms can increase revenue by segmenting the market into ever smaller sub-markets) and thickness (wherein the size of the sub-market affects the utility experienced by an agent). An important example of this is in dynamic online marketplaces, where buyers and sellers, in addition to preferences for different matches, also have finite patience (or deadlines) for being matched. We formalize this trade-off via a novel optimization problem that we term as 'Two-sided Facility Location': we consider a market wherein agents arrive at nodes embedded in an underlying metric space, where the distance between a buyer and seller captures the quality of the corresponding match. The platform posts prices and wages at the nodes, and opens a set of virtual clearinghouses where agents are routed for matching. To ensure high match-quality, the platform imposes a distance constraint between an agent and its clearinghouse; to ensure thickness, the platform requires the flow to any clearinghouse be at least a pre-specified lower bound. Subject to these constraints, the goal of the platform is to maximize the social surplus subject to weak budget balance, i.e., profit being non-negative. Our work characterizes the complexity of this problem by providing both hardness results as well as algorithms for this setting; in particular, we present an algorithm that for any constant ε > 0 yields a (1 + ε) approximation for the gains from trade, while relaxing the match quality (i.e., maximum distance of any match) by a constant factor.

Although using look-ahead information is known to improve the competitive ratios of online convex optimization (OCO) problems with switching costs, the competitive ratios obtained from existing results often depend on the cost coefficients of the problem, and can potentially be large. In this paper, we propose new online algorithms that can utilize look-ahead to achieve much lower competitive ratios for OCO problems with switching costs and hard constraints. For the perfect look-ahead case where the algorithm is provided with the exact inputs in a future look-ahead window of size K, we design an Averaging Regularized Moving Horizon Control (ARMHC) algorithm that can achieve a competitive ratio of K+1/K. To the best of our knowledge, ARMHC is the first to attain a low competitive ratio that is independent of either the coefficients of the switching costs and service costs, or the upper and lower bounds of the inputs. Then, for the case when the future look-ahead has errors, we develop a Weighting Regularized Moving Horizon Control (WRMHC) algorithm that carefully weights the decisions inside the look-ahead window based on the accuracy of the look-ahead information. As a result, WRMHC also achieves a low competitive ratio that is independent of the cost coefficients, even with uncertain hard constraints. Finally, our analysis extends online primal-dual analysis to the case with look-ahead by introducing a novel "re-stitching" idea, which is of independent interest.

This paper studies online optimization under inventory (budget) constraints. While online optimization is a well-studied topic, versions with inventory constraints have proven difficult. We consider a formulation of inventory-constrained optimization that is a generalization of the classic one-way trading problem and has a wide range of applications. We present a new algorithmic framework, CR-Pursuit, and prove that it achieves the optimal competitive ratio among all deterministic algorithms (up to a problem-dependent constant factor) for inventory-constrained online optimization. Our algorithm and its analysis not only simplify and unify the state-of-the-art results for the standard one-way trading problem, but they also establish novel bounds for generalizations including concave revenue functions. For example, for one-way trading with price elasticity, CR-Pursuit achieves a competitive ratio within a small additive constant (i.e., 1/3) to the lower bound of lnӨ+1, where Ө is the ratio between the maximum and minimum base prices.

Vehicle service providers can display commercial ads in their vehicles based on passengers' origins and destinations to create a new revenue stream. We study a vehicle service provider who can generate different ad revenues when displaying ads on different arcs (i.e., origin-destination pairs). The provider needs to ensure the vehicle flow balance at each location, which makes it challenging to analyze the provider's vehicle assignment and pricing decisions for different arcs. To tackle the problem, we show that certain properties of the traffic network can be captured by a corresponding electrical network. When the effective resistance between two locations is small, there are many paths between the two locations and the provider can easily route vehicles between them. We derive the provider's optimal vehicle assignment and pricing decisions based on effective resistances.

Many systems, such as the Internet, social networks, and the power grid, can be represented as graphs. When analyzing graphs, it is often useful to compute scores describing the relative importance or distance between nodes. One example is Personalized PageRank (PPR), which assigns to each node v a vector whose i-th entry describes the importance of the i-th node from the perspective of v. PPR has proven useful in many applications, such as recommending who users should follow on social networks (if this i-th entry is large, v may be interested in following the i-th user). Unfortunately, computing n PPR vectors exactly for a graph of n nodes has complexity O(n3), which is infeasible for many graphs of interest.

In this work, we devise a scheme to estimate all n PPR vectors with bounded l_1 error and complexity O(nc), where c<2 depends on the degrees of the graph at hand, the desired error tolerance, and a parameter that defines PPR. This improves upon existing methods, the best of which have complexity O(n2 log n) in our setting. Our complexity guarantee holds with high probability, for certain choices of the PPR parameter, and for a certain class of random graphs (roughly speaking, the sparse directed configuration model with heavy-tailed in-degrees); our accuracy guarantee holds with probability 1 and for arbitrary graphs and PPR parameters. The complexity result arises as a consequence of our main (structural) result, which shows that the dimensionality of the set of PPR vectors scales sublinearly in n with high probability, for the same class of random graphs and for a notion of dimensionality similar to matrix rank. It is this coupling of the PPR vectors for the nodes on a common underlying graph that allows for estimating them faster. Hence, at a high level, our scheme is analogous to (but distinct from) low-rank matrix approximation. We also note that our scheme is similar to one that was proposed in [Jeh and Widom 2003] but lacked accuracy and complexity guarantees, so another contribution of our paper is to address this gap in the literature.

We consider a bandit problem with K task types from which the controller activates one task at a time. Each task takes a random and possibly heavy-tailed completion time, and a reward is obtained only after the task is completed. The task types are independent from each other, and have distinct and unknown distributions for completion time and reward. For a given time horizon τ, the goal of the controller is to schedule tasks adaptively so as to maximize the reward collected until τ expires. In addition, we allow the controller to interrupt a task and initiate a new one. In addition to the traditional exploration-exploitation dilemma, this interrupt mechanism introduces a new one: should the controller complete the task and get the reward, or interrupt the task for a possibly shorter and more rewarding alternative? We show that for all heavy-tailed and some light-tailed completion time distributions, this interruption mechanism improves the reward linearly over time. From a learning perspective, the interrupt mechanism necessitates implicitly learning statistics beyond the mean from truncated observations. For this purpose, we propose a robust learning algorithm named UCB-BwI based on the median-of-means estimator for possibly heavy-tailed reward and completion time distributions. We show that, in a K-armed bandit setting with an arbitrary set of L possible interrupt times, UCB-BwI achieves O(Kłog(τ)+KL) regret. We also prove that the regret under any admissible policy is Ømega(Kłog(τ)), which implies that UCB-BwI is order optimal.

Modern networked systems are increasingly reconfigurable, enabling demand-aware infrastructures whose resources can be adjusted according to the workload they currently serve. Such dynamic adjustments can be exploited to improve network utilization and hence performance, by moving frequently interacting communication partners closer, e.g., collocating them in the same server or datacenter. However, dynamically changing the embedding of workloads is algorithmically challenging: communication patterns are often not known ahead of time, but must be learned. During the learning process, overheads related to unnecessary moves (i.e., re-embeddings) should be minimized. This paper studies a fundamental model which captures the tradeoff between the benefits and costs of dynamically collocating communication partners on l servers, in an online manner. Our main contribution is a distributed online algorithm which is asymptotically almost optimal, i.e., almost matches the lower bound (also derived in this paper) on the competitive ratio of any (distributed or centralized) online algorithm.

Container Orchestration Platforms (COPs), such as Kubernetes, are increasingly used to manage large-scale clusters by automating resource allocation between applications encapsulated in containers. Increasingly, the resources underlying COPs are virtual machines (VMs) dynamically acquired from cloud platforms. COPs may choose from many different types of VMs offered by cloud platforms, which differ in their cost, performance, and availability. While transient VMs cost significantly less than on-demand VMs, platforms may revoke them at any time, causing them to become unavailable. While transient VMs' price is attractive, their unreliability is a problem for COPs designed to support mixed workloads composed of, not only delay-tolerant batch jobs, but also long-lived interactive services with high availability requirements.

To address the problem, we design TR-Kubernetes, a COP that optimizes the cost of executing mixed interactive and batch workloads on cloud platforms using transient VMs. To do so, TR-Kubernetes enforces arbitrary availability requirements specified by interactive services despite transient VM unavailability by acquiring many more transient VMs than necessary most of the time, which it then leverages to opportunistically execute batch jobs when excess resources are available. When cloud platforms revoke transient VMs, TR-Kubernetes relies on existing Kubernetes functions to internally revoke resources from batch jobs to maintain interactive services' availability requirements. We show that TR-Kubernetes requires minimal extensions to Kubernetes, and is capable of lowering the cost (by 53%) and improving the availability (99.999%) of a representative interactive/batch workload on Amazon EC2 when using transient compared to on-demand VMs.

Several different control methods are used in practice or have been proposed to cost-effectively provision IT resources. Due to the dependency of many control methods on having accurate predictions of the future to make good provisioning decisions, there has been a great deal of literature on prediction workload demand. However, even with all of this literature on workload predictions and their utilization in control algorithms, the understanding of prediction error and how to handle it remains an important open issue and research challenge.

In this paper we aim to mend this gap by making the following contributions: (i) Prediction error is modeled to aid in proving worst-case dynamic regret bounds for control algorithms. (ii) Upper bounds on dynamic regret are proven for a variety of algorithms in terms of the prediction error model. In order to choose which algorithm to run without prediction error knowledge, a simple online meta-algorithm is designed. (iii) A detailed analysis of prediction accuracy is done for cloud computing by fitting real-world CPU utilization traces of Azure virtual machines to popular prediction models. (iv) Using real-world trace based simulations of CPU allocation for virtual machines, the proposed meta-algorithm is shown to outperform a popular algorithm selection policy and perform very closely to that of the best algorithm chosen in hindsight.

The Least Recently Used (LRU) caching and its variants are used in large-scale data systems in order to provide high-speed data access for a wide class of applications. Nonetheless, a fundamental question still remains: in order to minimize the miss probabilities, how should the cache space be organized to serve multiple data flows? Commonly used strategies can be categorized into two designs: pooled LRU (PLRU) caching and separated LRU (SLRU) caching. However, neither of these designs can satisfactorily solve this problem. PLRU caching is easy to implement and self-adaptive, but does not often achieve optimal or even efficient performance because its set of feasible solutions are limited. SLRU caching can be statically configured to achieve optimal performance for stationary workload, which nevertheless could suffer in a dynamically changing environment and from a cold-start problem. To this end, we propose a new insertion based pooled LRU paradigm, named I-PLRU, where the data flows can be inserted at different positions of a pooled cache. This new design can achieve the optimal performance of the static SLRU, but retains the adaptability of PLRU for resource sharing. Theoretically, we characterize the asymptotic miss probabilities of I-PLRU, and prove that, for any given SLRU design, there always exists an I-PLRU configuration that achieves the same asymptotic miss probability, and vice versa. We next design a policy to minimize the miss probabilities. However, the miss probability minimization problem turns out to be non-convex under the I-PLRU paradigm. Notably, we utilize an equivalence mapping between I-PLRU and SLRU to efficiently find the optimal I-PLRU configuration. We prove that I-PLRU outperforms PLRU and achieves the same miss probability as the optimal SLRU under a stationary request arrival process. From an engineering perspective, the flexibility of I-PLRU avoids separating the memory space, supports dynamic and refined configurations, and alleviates the cold-start problem, potentially yielding better performance than both SLRU and PLRU.

Recent years have witnessed a revival of both industrial and academic interest in improving congestion control designs. The quest for better congestion control is complicated by the extreme diversity and range of (i) the design space (as exemplified by the stark conceptual and operational differences between recent proposals~\citebbr,vivace, COPA ), (ii) the desired properties (ranging from high performance to fairness to TCP-friendliness), (iii) the envisioned operational setting (inter- and intra-datacenter, wireless, the commercial Internet, satellite), and (iv) the application loads and requirements (small vs. large traffic demands, latency- vs. bandwidth-sensitive).

Most congestion control research uses simulation and experiments under a limited range of network conditions. This is extremely important for understanding the detailed performance of particular schemes in specific settings, but provides limited insight into the more general properties of these schemes and no information about the inherent limits %of congestion control designs (such as, which properties are simultaneously achievable and which are mutually exclusive). In contrast, traditional theoretical approaches are typically focused on the design of protocols that achieve specific, predetermined objectives (e.g., network utility maximization~\citekelly2009 ), or the analysis of specific protocols (e.g., from control-theoretic perspectives~\citepaganini2003control ), as opposed to exploring the inherent tensions/derivations between desired properties.

We advocate an axiomatic approach to congestion control, which is complementary to the experimental and theoretical work currently being pursued. Our approach, modeled on similar efforts in social choice theory and game theory~\citearrow1950difficulty, identifies a set of requirements ("axioms'') and then identifies (i) which of its subsets of requirements can coexist (i.e., there are designs that achieve all of them) and which subsets cannot be met simultaneously (i.e., no design can simultaneously achieve all of them), and (ii) whether some requirements immediately follow from satisfying other requirements. Thus, the axiomatic approach can shed light on the inherent tradeoffs involved in congestion control protocol design, and can be leveraged to classify existing and proposed solutions according to the properties they satisfy.

The axiomatic approach has been applied to many computer science environments, e.g. reputation systems, recommendation systems, link prediction, and more. To the best of our knowledge, ours is the first application of this approach to congestion control protocols (though \citeshenker1990theoretical touches on the subject briefly).

We introduce a simple network model where we can evaluate congestion control designs and formulate several natural axioms (or requirements) for congestion control protocols, including efficient link-utilization, loss-avoidance, fairness, stability, and TCP-friendliness. Congestion control protocols can be regarded as points in a multidimensional space reflecting the extent to which they satisfy these requirements, and we show how classical families of congestion control protocols (e.g., additive-increase-multiplicative-decrease, multiplicative-increase-multiplicative-decrease, and more) can be mapped to points in this space.

We leverage our axiomatic framework to derive basic results on the feasibility of simultaneously achieving different requirements within a single design. Our results formalize and shed light on various empirical/experimental observations about tensions between different desiderata, including (1) the tension between attaining high performance and being friendly to legacy TCP connections~\citelakshman1997performance,padhye1998modeling, (2) the tension between achieving high bandwidth and maintaining low latency under dynamic environments~\citebrakmo1995tcp,mo1999analysis,sprout, and (3) the tension between being robust to non-congestion loss and not incurring high loss upon convergence~\citevivace. From a protocol design perspective, desirable congestion control protocols are those that reside on the Pareto frontier in the multidimensional space induced by our requirements, and this Pareto frontier is characterized by our theoretical results.

To be sure, our axiomatic approach has its limitations, as it revolves around investigating these properties in a simplified model. However, we feel that the results, in terms of which axioms can coexist and which cannot, provide insights that apply far beyond the simple model (even if the detailed theoretical results do not). Thus, we contend that the axiomatic approach is a useful addition to the evaluatory arsenal that researchers should apply to congestion control.

We view our results as a first step and leave the reader with many research directions relating to the extension of our model and the re-examination of our axioms/metrics. Thus far, axiomatic approaches have been applied to a few, very specific, networking environments~\citeshenker1990theoretical,lev2015axiomatic,cohen2015axiomatic. We believe that applying the axiomatic approach to other networking contexts (e.g., intradomain~\citelev2015axiomatic and interdomain routing, traffic engineering, in-network queueing~\citesivaraman2013no, network security) could contribute to more principled discussions about these contexts.

We propose a general framework, dubbed Stochastic Processing under Imperfect Information (SPII), to study the impact of information constraints and memories on dynamic re- source allocation. The framework involves a Stochastic Processing Network (SPN) scheduling problem in which the decision maker may access the system state only through a noisy channel, and resource allocation decisions must be carried out through the interaction between an encoding policy (who observes the state) and allocation policy (who chooses the allocation). Applications in the management of large-scale data centers and human-in-the-loop service systems are among our chief motivations.

We quantify the degree to which information constraints reduce the size of the capacity region in general SPNs, and how such reduction depends on the amount of memories available to the encoding and allocation policies. Using a novel metric, capacity factor, our main theorem characterizes the reduction in capacity region (under "optimal" policies) for all non-degenerate channels, and across almost all combinations of memory sizes. Notably, the theorem demonstrates, in substantial generality, that (1) the presence of a noisy channel always reduces capacity, (2) more memories for the allocation policy always improve capacity, and (3) more memories for the encoding policy have little to no effect on capacity. Finally, all of our positive (achievability) results are established through constructive, implementable policies.

Our proof program involves the development of a host of new techniques by combining ideas from information theory, learning and queueing theory. We create a simple yet powerful generalization of the Max-Weight policy, in which individual Markov chains are selected dynamically, in a manner analogous to how schedules are used in a conventional Max-Weight policy.

When evaluating the impact of a policy (e.g., gun control) on a metric of interest (e.g., crime-rate), it may not be possible or feasible to conduct a randomized control trial. In such settings where only observational data is available, synthetic control (SC) methods \citeabadie1, abadie2, abadie3 provide a popular data-driven approach to estimate a "synthetic'' or "virtual'' control by combining measurements of "similar'' alternatives or units (called "donors'').

Recently, robust synthetic control (RSC) \citersc1 was proposed as a generalization of SC to overcome the challenges of missing data and high levels of noise, while removing the reliance on expert domain knowledge for selecting donors. However, both SC and RSC (and its variants) suffer from poor estimation when the pre-intervention period is too short.

As the main contribution of this work, we propose a generalization of unidimensional RSC to multi-dimensional Robust Synthetic Control, mRSC. Our proposed mechanism, mRSC, incorporates multiple types of measurements (or metrics) in addition to the measurement of interest for estimating a synthetic control, thus overcoming the challenge of poor inference due to limited amounts of pre-intervention data. We show that the mRSC algorithm, when using K relevant metrics, leads to a consistent estimator of the synthetic control for the target unit of interest under any metric. Our finite-sample analysis suggests that the mean-squared error (MSE) of our predictions decays to zero at a rate faster than the RSC algorithm by a factor of K and $\sqrtK $ for the training (pre-intervention) and testing (post-intervention) periods, respectively. Additionally, we propose a principled scheme to combine multiple metrics of interest via a diagnostic test that evaluates if adding a metric can be expected to result in improved inference.

Our mechanism for validating mRSC performance is also an important and related contribution of this work: time series prediction. We propose a method to predict the future evolution of a time series based on limited data when the notion of time is relative and not absolute, i.e., where we have access to a donor pool that has already undergone the desired future evolution.

We conduct extensive experimentation to establish the efficacy of mRSC in three different scenarios: predicting the evolution of a metric of interest using synthetically generated data from a known factor model, and forecasting weekly sales and score trajectories of a Walmart store and Cricket game, respectively.

Most congestion control algorithms, like TCP, rely on a reactive control system that detects congestion, then marches carefully towards a desired operating point (e.g. by modifying the window size or adjusting a rate). In an effort to balance stability and convergence speed, they often take hundreds of RTTs to converge; an increasing problem as networks get faster, with less time to react.

This paper is about an alternative class of congestion control algorithms based on proactive-scheduling: switches and NICs "pro-actively" exchange control messages to run a distributed algorithm to pick "explicit rates" for each flow. We call these Proactive Explicit Rate Control (PERC) algorithms. They take as input the routing matrix and link speeds, but not a congestion signal. By exploiting information such as the number of flows at a link, they can converge an order of magnitude faster than reactive algorithms.

Our main contributions are (1) s-PERC ("stateless" PERC), a new practical distributed PERC algorithm without per-flow state at the switches, and (2) a proof that s-PERC computes exact max-min fair rates in a known bounded time, the first such algorithm to do so without per-flow state. To analyze s-PERC, we introduce a parallel variant of standard waterfilling, 2-Waterfilling. We prove that s-PERC converges to max-min fair in 6N rounds, where N is the number of iterations 2-Waterfilling takes for the same routing matrix.

We describe how to make s-PERC practical and robust to deploy in real networks. We confirm using realistic simulations and an FPGA hardware testbed that s-PERC converges 10-100x faster than reactive algorithms like TCP, DCTCP and RCP in data-center networks and 1.3--6x faster in wide-area networks (WANs). Long flows complete in close to the ideal time, while short-lived flows are prioritized, making it appropriate for data-centers and WANs.

Mean field modeling is a popular approach to assess the performance of large scale computer systems. The evolution of many mean field models is characterized by a set of ordinary differential equations that have a unique fixed point. In order to prove that this unique fixed point corresponds to the limit of the stationary measures of the finite systems, the unique fixed point must be a global attractor. While global attraction was established for various systems in case of exponential job sizes, it is often unclear whether these proof techniques can be generalized to non-exponential job sizes.

In this paper we show how simple monotonicity arguments can be used to prove global attraction for a broad class of ordinary differential equations that capture the evolution of mean field models with hyperexponential job sizes. This class includes both existing as well as previously unstudied load balancing schemes and can be used for systems with either finite or infinite buffers.

The main novelty of the approach exists in using a Coxian representation for the hyperexponential job sizes and a partial order that is stronger than the componentwise partial order used in the exponential case.

Load balancing algorithms play a vital role in enhancing performance in data centers and cloud networks. Due to the massive size of these systems, scalability challenges, and especially the communication overhead associated with load balancing mechanisms, have emerged as major concerns. Motivated by these issues, we introduce and analyze a novel class of load balancing schemes where the various servers provide occasional queue updates to guide the load assignment.

We show that the proposed schemes strongly outperform JSQ(d) strategies with comparable communication overhead per job, and can achieve a vanishing waiting time in the many-server limit with just one message per job, just like the popular JIQ scheme. The proposed schemes are particularly geared however towards the sparse feedback regime with less than one message per job, where they outperform corresponding sparsified JIQ versions.

We investigate fluid limits for synchronous updates as well as asynchronous exponential update intervals. The fixed point of the fluid limit is identified in the latter case, and used to derive the queue length distribution. We also demonstrate that in the ultra-low feedback regime the mean stationary waiting time tends to a constant in the synchronous case, but grows without bound in the asynchronous case.

A simple bound in GI/G/1 queues was obtained by Kingman using a discrete martingale transform. We extend this technique to 1) multiclass ΣGI/G/1 queues and 2) Markov Additive Processes (MAPs) whose background processes can be time-inhomogeneous or have an uncountable state-space. Both extensions are facilitated by a necessary and sufficient ordinary differential equation (ODE) condition for MAPs to admit continuous martingale transforms. Simulations show that the bounds on waiting time distributions are almost exact in heavy-traffic, including the cases of 1) heterogeneous input, e.g., mixing Weibull and Erlang-k classes and 2) Generalized Markovian Arrival Processes, a new class extending the Batch Markovian Arrival Processes to continuous batch sizes.

We derive simple bounds on the queue distribution in finite-buffer queues with Markovian arrivals. The bounds capture a truncated exponential behavior, involving joint horizontal and vertical shifts of an exponential function; this is fundamentally different than existing results capturing horizontal shifts only. We also obtain similar bounds on the loss distribution, which is a key metric to understand the impact of finite-buffer queues on real-time applications. Simulations show that the bounds are accurate in heavy-traffic regimes, and improve existing ones by orders of magnitude.

This paper studies the scaling of the expected total queue size in an $n\times n$ input-queued switch, as a function of both the load ρ and the system scale n. We provide a new class of scheduling policies under which the expected total queue size scales as Ołeft( n(1-ρ)^-4/3 łog łeft(\max\\frac1 1-ρ, n\ \right)\right)$, over all n and ρ<1$, when the arrival rates are uniform. This improves over the previously best-known scalings in two regimes: Ołeft(n^1.5 (1-ρ)^-1 łog \frac1 1-ρ \right)$ when Ømega(n^-1.5 ) łe 1-ρ łe O(n^-1 )$ and $Ołeft(\fracnłog n (1-ρ)^2 \right)$ when $1-ρ \geq Ømega(n^-1 ). A key ingredient in our method is a tight characterization of the largest k-factor of a random bipartite multigraph, which may be of independent interest.

In this paper, we examine the design tradeoffs of existing in-memory data structures of a state-of-the-art key-value store. We observe that no data structures provide both fast point-accesses and consistent ranged-retrievals, and naitive amalgamations of existing structures fail to get the best of both worlds. Furthermore, our experiments reveal a performance anomaly when increasing the memory size: as more key-value pairs are maintained in memory, the shortcomings of the data structures exacerbate. To address the above problems, we present TeksDB, a fast and consistent key-value store with a novel in-memory data structure, which efficiently handles both point- and ranged- accesses at a modest increase in memory footprint. Our evaluation demonstrates that TeksDB outperforms RocksDB by 3.6×, 9×, and 4.5× for get, scan, and range_query, respectively. The effectiveness of TeksDB extends to real-world workloads, achieving up to 3.3× speedup for YCSB.

Application performance on novel memory systems is typically estimated using a hardware simulator. The simulation is, however, time consuming, which limits the number of design options that can be explored within a practical length of time. Also, although memory simulators are typicallywell validated, current CPU simulators have various shortcomings, such as simplified out-of-order execution, an obsolete data prefetcher and a lack of virtual-to-physical memory translation, all of which can make a huge difference between the simulated and actual memory system.

Virtualization is becoming increasingly common in data centers due to its various advantages. However, how to choose among different platforms, including both software and hardware, is a considerable challenge. In this context, evaluating the virtualization capabilities of different platforms is critically important. Regrettably, the existing benchmarks are not qualified for meeting this requirement. Different hardware mechanisms and hypervisor designs introduce many different hypervisor-level events, such as transitions between VMs and the hypervisor, two-dimensional page walk, and binary translation. These events are key factors affecting virtualization performance. Existing benchmarks either overlook these changes or are tightly coupled to a particular hypervisor.

In this paper, we present HyperBench, a benchmark suite that focuses on the capabilities of different virtualization platforms. Currently, we design 15 hypervisor benchmarks covering CPU, memory, and I/O. The virtualization-sensitive operation in each benchmark triggers hypervisor-level events, which examines the platform's ability in the target area. HyperBench is designed as a custom kernel which can adapt to different hypervisors and architectures. What's more, adding a new benchmark is pretty easy. Finally, we perform a series of experiments on the host machine and several popular hypervisors, such as QEMU, KVM, and Xen, demonstrating that HyperBench is capable of revealing the performance implications of the hardware mechanism and hypervisor design.

Customizability is a key feature of the Android operating system that differentiates it from Apple's iOS. One concrete feature that gaining popularity is called "app virtualization''. This feature allows multiple copies of the same app to be installed and opened simultaneously (e.g., with multiple accounts logged in). Virtualization frameworks are used by more than 100 million users worldwide. As with any new system features, we are interested in two aspects: (1) whether the feature itself introduces security risks and (2) whether the feature is abused for unintended purposes. This paper conducts a systematic study on the two aspects of the app virtualization techniques. With a thorough study of 32 popular virtualization frameworks from Google Play, we identify seven areas of potential attack vectors and find that most of the frameworks are susceptible to them. By deeply investigating their ecosystem, we show, with demonstrations, that attackers can easily distribute malware that takes advantage of these attack vectors. In addition, we show that the same virtualization techniques are also abused by malware as an alternative and easy-to-use repackaging mechanism. To this end, we design and implement a new app repackage detector. After scanning 250,145 apps from app markets, it finds 164 repackaged apps that attempt to steal user credentials and private data.

Intel SGX has attracted much attention from academia and is already powering commercial applications. Cloud providers have also started implementing SGX in their cloud offerings. Research efforts on Intel SGX so far have mainly focused on its security and programmability aspects. However, no work has studied in detail the performance degradation caused by SGX in virtualized systems. Such settings are particularly important, considering that virtualization is the de facto building block of cloud infrastructure, yet often comes with a performance impact. This paper presents the first detailed performance analysis of Intel SGX in a virtualized system compareed against bare-metal.Based on our findings, we identify several optimization strategies that would result in performance improvements of Intel SGX on such systems.

This paper combines data-driven and model-driven methods for real-time misinformation detection. Our algorithm, named QuickStop, is an optimal stopping algorithm based on a probabilistic information spreading model obtained from labeled data. The algorithm consists of an offline machine learning algorithm for learning the probabilistic information spreading model and an online optimal stopping algorithm to detect misinformation. The online detection algorithm has both low computational and memory complexities. Our numerical evaluations with a real-world dataset show that QuickStop outperforms existing misinformation detection algorithms in terms of both accuracy and detection time (number of observations needed for detection). Our evaluations with synthetic data further show that QuickStop is robust to (offline) learning errors.

Motivated by the success of using black-box predictive algorithms as subroutines for online decision-making, we develop a new framework for designing online policies given access to an oracle providing statistical information about an offline benchmark. Having access to such prediction oracles enables simple and natural Bayesian selection policies, and raises the question as to how these policies perform in different settings. Our work makes two important contributions towards tackling this question: First, we develop a general technique we call *compensated coupling* which can be used to derive bounds on the expected regret (i.e., additive loss with respect to a benchmark) for any online policy and offline benchmark; Second, using this technique, we show that the Bayes Selector has constant expected regret (i.e., independent of the number of arrivals and resource levels) in any online packing and matching problem with a finite type-space. Our results generalize and simplify many existing results for online packing and matching problems, and suggest a promising pathway for obtaining oracle-driven policies for other online decision-making settings.

We consider unreliable distributed learning systems wherein the training data is kept confidential by external workers, and the learner has to interact closely with those workers to train a model. In particular, we assume that there exists a system adversary that can adaptively compromise some workers; the compromised workers deviate from their local designed specifications by sending out arbitrarily malicious messages.

We assume in each communication round, up to q out of the m workers suffer Byzantine faults. Each worker keeps a local sample of size n and the total sample size is $N=nm$. We propose a secured variant of the gradient descent method that can tolerate up to a constant fraction of Byzantine workers, i.e., $q/m = O(1)$. Moreover, we show the statistical estimation error of the iterates converges in $O(łog N)$ rounds to $O(\sqrtq/N + \sqrtd/N )$, where d is the model dimension. As long as $q=O(d)$, our proposed algorithm achieves the optimal error rate $O(\sqrtd/N )$. Our results are obtained under some technical assumptions. Specifically, we assume strongly-convex population risk. Nevertheless, the empirical risk (sample version) is allowed to be non-convex. The core of our method is to robustly aggregate the gradients computed by the workers based on the filtering procedure proposed by Steinhardt et al. \citeSteinhardt18. On the technical front, deviating from the existing literature on robustly estimating a finite-dimensional mean vector, we establish a \em uniform concentration of the sample covariance matrix of gradients, and show that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function. To get a near-optimal uniform concentration bound, we develop a new matrix concentration inequality, which might be of independent interest.

We propose an algorithm to impute and forecast a time series by transforming the observed time series into a matrix, utilizing matrix estimation to recover missing values and de-noise observed entries, and performing linear regression to make predictions. At the core of our analysis is a representation result, which states that for a large class of models, the transformed time series matrix is (approximately) low-rank. In effect, this generalizes the widely used Singular Spectrum Analysis (SSA) in the time series literature, and allows us to establish a rigorous link between time series analysis and matrix estimation. The key to establishing this link is constructing a Page matrix with non-overlapping entries rather than a Hankel matrix as is commonly done in the literature (e.g., SSA). This particular matrix structure allows us to provide finite sample analysis for imputation and prediction, and prove the asymptotic consistency of our method. Another salient feature of our algorithm is that it is model agnostic with respect to both the underlying time dynamics and the noise distribution in the observations. The noise agnostic property of our approach allows us to recover the latent states when only given access to noisy and partial observations a la a Hidden Markov Model; e.g., recovering the time-varying parameter of a Poisson process without knowing that the underlying process is Poisson. Furthermore, since our forecasting algorithm requires regression with noisy features, our approach suggests a matrix estimation based method---coupled with a novel, non-standard matrix estimation error metric---to solve the error-in-variable regression problem, which could be of interest in its own right. Through synthetic and real-world datasets, we demonstrate that our algorithm outperforms standard software packages (including R libraries) in the presence of missing data as well as high levels of noise.

Web browsers have become one of the most commonly used applications for desktop and mobile users. Despite recent advances in network speeds and several techniques to speed up web page loading, browsers still suffer from relatively long page load time (PLT). Particularly, web applications need browsers to have a higher performance to compete with native applications. Recent studies show that network connection is not the bottleneck of the browser's performance anymore, however, no subsequent analysis has been conducted to inspect which parts of the browser's computation contribute to the performance overhead.

In this paper, we apply a comprehensive and quantitative what-if analysis on the web browser's page loading process. Unlike conventional profiling methods, we apply causal profiling to precisely determine the impact of each computation stage such as HTML parsing and Layout on PLT. For this purpose, we develop COZ+1 is available at https://gitlab.com/coz-plus/coz-plus, a high-performance causal profiler capable of analyzing large software systems such as the Chromium browser. COZ+ highlights the most influential spots for further optimization. For example, it shows that optimizing JavaScript by 40% is expected to improve the Chromium page loading performance by more than 8.5% under typical network conditions.

Network traffic for delay-sensitive services has become a dominant part in the network. Proactive caching with the aid of predictive information has been proposed as a promising method to enhance delay performance. In this paper, we analytically investigate the problem of how to efficiently utilize uncertain predictive information to design proactive caching strategies with provably good access-delay characteristics. We first derive an upper bound for the average amount of proactive service per request that the system can support. We then analyze the behavior of a family of threshold-based proactive strategies using an innovative Markov chain analysis, and show that the average amount of proactive service per request can be maximized by properly selecting the threshold. Finally, we propose the UNIFORM strategy, which is the threshold-based strategy with the optimal threshold. Surprisingly, we show that it outperforms the commonly used Earliest-Deadline-First (EDF) type proactive strategies in terms of delay. We perform extensive numerical experiments to demonstrate the influence of thresholds on delay performance, and explicitly compare performance of the EDF strategy and the UNIFORM strategy.

Networking on wearable devices such as smartwatches is becoming increasingly important as fueled by new hardware, OS support, and applications. In this work, we conduct a first in-depth investigation of the networking performance of Wear OS, one of the most popular OSes for wearables. Through carefully designed controlled experiments conducted in a cross-device, cross-protocol, and cross-layer manner, we identify serious performance issues of Wear OS regarding key aspects that distinguish wearable networking from smartphone networking: Bluetooth (BT) performance, smartphone proxying, network interface selection, and BT-WiFi handover. We pinpoint their root causes and quantify their impacts on network performance and application QoE. We further propose practical suggestions to improve wearable networking performance.

It has become increasingly difficult to understand the complex interaction between modern applications and main memory, composed of Dynamic Random Access Memory (DRAM) chips. Manufacturers and researchers are developing many different types of DRAM, with each DRAM type catering to different needs (e.g., high throughput, low power, high memory density). At the same time, the memory access patterns of prevalent and emerging applications are rapidly diverging, as these applications manipulate larger data sets in very different ways. As a result, the combined DRAM-workload behavior is often difficult to intuitively determine today, which can hinder memory optimizations in both hardware and software.

In this work, we identify important families of workloads, as well as prevalent types of DRAM chips, and rigorously analyze the combined DRAM-workload behavior. To this end, we perform a comprehensive experimental study of the interaction between nine different DRAM types (DDR3/4, LPDDR3/4, GDDR5, Wide I/O, Wide I/O 2, HBM, HMC) and 115 modern applications and multiprogrammed workloads from six diverse application families (desktop/scientific, server/cloud, multimedia acceleration, network acceleration, GPGPU, OS routines). We draw 12 key observations from our characterization, enabled in part by our development of new metrics that quantify the effect of memory access patterns on hardware utilization. We highlight our five most significant observations here:

(1) Despite having 50% higher memory bandwidth than DDR3, the newer DDR4 rarely outperforms DDR3 on the applications we evaluate, as DDR4's access latency is 11-14% higher.

(2) The high-bandwidth HMC does not outperform DDR3 for most single-thread workloads and many multithreaded applications. This is because HMC's design trade-offs (e.g., a row width that is 97% smaller than DDR3) fundamentally limit opportunities for exploiting spatial locality. For example, single-thread desktop and scientific applications actually perform 5.8% worse with HMC than with DDR3, on average, even though HMC offers 87.4% more memory bandwidth. HMC provides significant performance improvements over other DRAM types in cases where application spatial locality is low(or is destroyed), such as highly-memory-intensive multiprogrammed workloads.

(3) While low-power DRAM types typically perform worse than standard-power DRAM for most memory-intensive applications, some low-power DRAM types perform well when bandwidth demand is very high. For example, on average, LPDDR4 performs only 7.0% worse than DDR3 for our multiprogrammed desktop workloads, while consuming 68.2% less energy, and Wide I/O 2 performs 2.3% better than DDR3 for multimedia acceleration.

(4) The best DRAM for a heterogeneous system depends heavily on the predominant function(s) performed by the system. We study three types of applications for heterogeneous systems. First, multimedia acceleration benefits most from high-throughput memories that exploit a high amount of spatial locality, running up to 21.6% faster with GDDR5 and 14.7% faster with HBM than DDR3, but only 5.0% faster with HMC. Second, a network accelerator's memory requests are highly bursty and do not exhibit significant spatial locality, and are thus a good fit for the high bank-level parallelism of HMC (88.4% faster on average over DDR3). Third, GPGPU applications exhibit a wide range of memory intensity, but memory-intensive GPGPU applications typically also take advantage of spatial locality due to memory coalescing, and perform more effectively with HBM (26.9% higher on average over DDR3) and GDDR5 (39.7%) than with DDR3 or HMC.

(5) Several common OS routines (e.g., file I/O, process forking) exhibit extremely high spatial locality, and do not benefit from high amounts of bank-level parallelism. As a result, they perform better with memories such as DDR3 and GDDR5, which have lower access latencies than the other memory types that we study. Since OS routines are used across most computer systems in a widespread manner, we believe DRAM designers must provide low-latency access, instead of the current trend increasing the latency in order to deliver greater throughput.

For more information on our extensive experimental characterization, we refer the reader to the full version of our paper. We hope that the trends we identify can drive optimizations in both hardware and software design. To aid further study, we open-source our extensively-modified simulators, as well as MemBen, a benchmark suite containing our applications.

Markov Chain Monte Carlo (MCMC) has been the de facto technique for sampling and inference of large graphs such as online social networks. At the heart of MCMC lies the ability to construct an ergodic Markov chain that attains any given stationary distribution $\boldsymbolπ $, often in the form of random walks or crawling agents on the graph. Most of the works around MCMC, however, presume that the graph is undirected or has reciprocal edges, and become inapplicable when the graph is directed and non-reciprocal. Here we develop a similar framework for directed graphs called Non-Markovian Monte Carlo (NMMC) by establishing a mapping to convert $\boldsymbolπ $ into the quasi-stationary distribution of a carefully constructed transient Markov chain on an extended state space. As applications, we demonstrate how to achieve any given distribution $\boldsymbolπ $ on a directed graph and estimate the eigenvector centrality using a set of non-Markovian, history-dependent random walks on the same graph in a distributed manner. We also provide numerical results on various real-world directed graphs to confirm our theoretical findings, and present several practical enhancements to make our NMMC method ready for practical use in most directed graphs. To the best of our knowledge, the proposed NMMC framework for directed graphs is the first of its kind, unlocking all the limitations set by the standard MCMC methods for undirected graphs.

Graph alignment in two correlated random graphs refers to the task of identifying the correspondence between vertex sets of the graphs. Recent results have characterized the exact information-theoretic threshold for graph alignment in correlated Erdős-Rényi graphs. However, very little is known about the existence of efficient algorithms to achieve graph alignment without seeds.

In this work we identify a region in which a straightforward $\calO(n^11/5 łog n )$-time canonical labeling algorithm, initially introduced in the context of graph isomorphism, succeeds in aligning correlated Erdős-Rényi graphs. The algorithm has two steps. In the first step, all vertices are labeled by their degrees and a trivial minimum distance alignment (i.e., sorting vertices according to their degrees) matches a fixed number of highest degree vertices in the two graphs. Having identified this subset of vertices, the remaining vertices are matched using a alignment algorithm for bipartite graphs. Finally, we show that the implementation of a variant of this algorithm allows for the efficient alignment of large graphs under limited noise.

We consider the problem of estimating from sample paths the absolute spectral gap 1-λ⋆ of a reversible, irreducible and aperiodic Markov chain (Xt)t∈N over a finite state space Ω. We propose the UCPI (Upper Confidence Power Iteration) algorithm for this problem, a low-complexity algorithm which estimates the spectral gap in time O(n) and memory space O((ln n)2 given n samples. This is in stark contrast with most known methods which require at least memory space O(|Ω|), so that they cannot be applied to large state spaces. Furthermore, UCPI is amenable to parallel implementation.