Achieving Energetic Superiority Through System-Level Quantum Circuit Simulation
Abstract.
Quantum Computational Superiority boasts rapid computation and high energy efficiency. Despite recent advances in classical algorithms aimed at refuting the milestone claim of Google’s sycamore, challenges remain in generating uncorrelated samples of random quantum circuits.
In this paper, we present a groundbreaking large-scale system technology that leverages optimization on global, node, and device levels to achieve unprecedented scalability for tensor networks. This enables the handling of large-scale tensor networks with memory capacities reaching tens of terabytes, surpassing memory space constraints on a single node. Our techniques enable accommodating large-scale tensor networks with up to tens of terabytes of memory, reaching up to 2304 GPUs with a peak computing power of 561 PFLOPS half-precision. Notably, we have achieved a time-to-solution of 14.22 seconds with energy consumption of 2.39 kWh which achieved fidelity of 0.002 and our most remarkable result is a time-to-solution of 17.18 seconds, with energy consumption of only 0.29 kWh which achieved a XEB of 0.002 after post-processing, outperforming Google’s quantum processor Sycamore in both speed and energy efficiency, which recorded 600 seconds and 4.3 kWh, respectively.
1. OVERVIEW: A New Era of Speed and Efficiency in Circuit Modeling
In the field of quantum computing, Google’s claim of ”quantum supremacy” with its Sycamore quantum processor marks a crucial advancement (Quantum_supremacy, ). It generated 3 (1) million uncorrelated samples within 600 (200) seconds, which was claimed to require the world’s most powerful supercomputer approximately 10,000 years to complete. This achievement not only propels advancements in quantum computing (jiuzhang1, ; jiuzhang2, ; jiuzhang3, ; zuchongzhi56qubit, ; zuchongzhi60qubit, ), but also serves as a significant catalyst for progress in classical quantum circuit simulation.

The overarching challenge in simulating large-scale quantum circuit is its exponential time complexity, namely, computational cost for quantum circuit simulation increasing exponentially with respect to the size of the quantum circuit (bertels2021quantum, ; zlokapa2023boundaries, ; villalonga2020establishing, ; haner20175, ; wu2019full, ). To tackle this challenge, a myriad of methods (Alibaba_19days, ; Sunway_304s, ; liu2022validating, ; guo2019general, ; li2019quantum, ), delineated in Fig. 1, have focused on breaking down this large-scale simulation into smaller sub-networks that can be run on a classical super computer, while effectively reducing memory usage from petabytes (PB) to terabytes (TB) or even gigabytes (GB). However, samples generated in (Sunway_304s, ) exhibit significant correlations among them, thus not considered as faithfully simulating the Sycamore quantum processor. This issue has been addressed in subsequent tensor network researches (512GPUs_15h, ; 60GPUs_5days, ). To reduce the computational complexity, (leapfrogging, ; post-process, ) have developed a post-processing (aka, post-selection) algorithm, where they selected bitstrings with the top- probabilities from bitstrings. Thereby they enhanced the cross entropy benchmark (XEB) value by a factor of . The computational complexity incurred by calculating the probabilities of all samples within any correlated subspace (bitstrings that share some bits) is remarkably low. Post-processing means that computing 3 million independent correlated subspaces, each containing thousands of samples, and subsequently selecting the sample with the highest probability from each correlated subspace. This method not only generates uncorrelated samples but also significantly boosts the XEB value by an order of magnitude.
Although tensor network augmented with slicing techniques has demonstrated its advantages in large-scale simulations, the decomposition of quantum circuits into sub-networks typically results in an explosive growth in the computational cost, owing to the heavy overhead from redundant calculations (Lifetime-Based, ; brandhofer2023optimal, ; leapfrogging, ).
As shown in Fig. 2, the computational complexity of the optimal contraction path exhibits an approximate inversely proportional relationship with the maximum memory size. This motivates us to explore the possibilities of minimizing the total subtasks’ energy cost through a large-scale distributed-memory systems. However, this approach poses a predicament: when the memory usage exceeds the capacity of a single node, there will be vast costs associated with inter-node communication. The inter-node data exchange not only incurs latency but also severely hinder overall system performance and energy efficiency.

Facing the aforementioned challenges, we are intended to surpass the capabilities of present-day quantum processors, such as Sycamore, in terms of both speed and power consumption. Our major goals are twofold:
-
•
Firstly, achieving order of magnitude reduction in computation time marks a major breakthrough in the landscape of quantum computing. It transcends mere acceleration of the quantum circuit simulation; rather, it represents a fundamental shift in quantum computing where classical computers can not only keep up but also outperform quantum computers even in a computational task, such as random circuit sampling. Leading-edge supercomputers not only provide a fast and scalable simulator for increasingly intricate quantum circuits, but also potentially catalyze the development of wild-range fields, including cryptography and pharmacological research. This breakthrough could play a pivotal role in fostering environments where the mysteries of the once esoteric quantum realm are unveiled, pushing the frontiers of computational capability.
-
•
Secondly, energy consumption is essential in the realm of quantum computing. Some estimates suggest that quantum technology could reduce energy usage by a factor of 100 to 1000 by processing complex computations much more efficiently. Nonetheless, the total energy efficiency of quantum computing remains an open question, and low-energy solutions requires further exploration. The quest for energy superiority in circuit simulation, which was once the sole province of quantum processors, can demonstrate the expanding capabilities of supercomputers to solve intricate problems and without relying on an increase in energy. Furthermore, reducing energy consumption is in concert with global commitments to sustainability and environmental guardianship, fostering the creation of green computing frameworks. In conclusion, an energy-efficient quantum circuit simulator is crucial for ongoing cost-effective analysis of quantum computing and facilitation in driving towards sustainable development in real-world challenges.
Given the above considerations, we proposed the following system-level techniques encompassing algorithmic efficiency, parallel architecture, and precision control: (1) a three-level scheme tailored for large-scale contraction networks to fully leverage distributed-memory and boost energy-efficiency; (2) a hybrid communication strategy to maximize intra-node bandwidth utilization and alleviate inter-node data transfer, as well as low-precision quantization to reduce data volume fourfold; (3) an efficient Einsum extension for complex-half data, reducing memory requirements by half and leveraging high-speed fp16 tensor core computation. Furthermore, we adopt some additional optimizations, such as an optimized padding scheme that avoids heavy copy operations for large tensors and a recomputation technique which reduces the required number of nodes per sub-task by half in a 4T sub-task, leading to more efficient computing.
Our three-level parallelism strategy enables scalability in both the size of the tensor network and the number of GPUs utilized for executing parallel and independent sub-tasks. In our experiments, we tested two varying sizes of tensor networks: a 4TB and a 32TB tensor network (quantified in the complex-float format), with or without the post-processing technique proposed in (leapfrogging, )(post-process, ). This demonstrated close-to-linear scaling using up to 2,304 A100 GPUs.
We have achieved a time-to-solution of 14.22 seconds with an energy consumption of 2.39 kwh with fidelity of 0.002. Our top-tier solution comprises a 32T Tensor Network (green hollow pentagram in Fig.2) incorporating post-processing technique, which impressively cuts the time-to-solution for sampling bitstrings to just 17.18 seconds. This represents a significant reduction compared to Sycamore’s 600 seconds. Additionally, it consumes a mere 0.29 kWh of energy, markedly lower than Sycamore’s energy expenditure of 4.3 kWh (Quantum_supremacy, ). Up to our best knowledge, this achievement is the first to demonstrate a clear advantage over Sycamore in terms of energy consumption.
2. Current state of the art
2.1. Quantum circuits
Quantum circuits are constructed from qubits interconnected through a sequence of quantum gates. While classical circuits utilize high and low voltage levels to represent binary states, quantum circuits leverage quantum superposition. The interactions and manipulations of quantum states are dictated by unitary operations known as quantum gates. To acquire information embedded within quantum states, measurements are undertaken. The outcome of a measurement is a collapsed 0–1 bitstring in the chosen measurement basis (pan2023efficient, ). It is important to note that measurements annihilate the superposition of quantum states, necessitating the repetition of the entire experiment to gather more insights into the quantum state in question. Here, we provide a visual illustration of Google’s Sycamore (Quantum_supremacy, ) quantum circuit to clearly showcase the circuit architecture along with quantum gates.

Fig. 3 provides a portion of a 5-qubit quantum circuit example, within which each random quantum circuits (RQCs) consists of a series of full cycles followed by a half cycle ending with the measurement of all qubits. Each full cycle involves two steps: First, a single-qubit gate is applied to each qubit. Next, two-qubit gates are applied to pairs of qubits, with different qubit pairs being allowed to interact in different cycles. The RQC sequence is repeated in subsequent cycles. The half cycle preceding the measurement exclusively comprises single-qubit gates.
In Google’s Sycamore processor, they configure three distinct single-qubit gates, each of which represents a -rotation about an axis situated on the equator of the Bloch sphere. Disregarding the global phase, these gates can be represented as follows: , , , and two-qubit gates are determined by the cycle index:
where the parameters and of the matrix are determined by the qubit pairing.
2.2. RQC simulation methods
RQC simulations on classical computers generally fall into two main categories (Sunway_304s, ). A traditional approach to simulating the evolution of the quantum state is the state vector method (de2019massively, ), which tracks the evolution of quantum states using a state vector and time evolution described by the Schrödinger equation. Nevertheless, this would necessitate exponentially large memory space for an -qubit system (vidal2003efficient, ), thereby limiting RQC simulations of large numbers of qubits and high entanglement (state_vector, ).
An alternative method is to use tensor networks, where an -qubit quantum state can be formulated as a rank- tensor with complex number entries. Single-qubit and two-qubit quantum gate operations can be represented by rank- and rank- tensors, respectively. Consequently, RQC can be converted into tensor networks, which can then be contracted to yield specific final state bitstring amplitudes (villalonga2019flexible, ; chen2018classical, ).
Over the past four years, the simulation of Sycamore’s RQC has gradually converged into tensor network algorithms. Significant effort has been devoted to two key points: finding the optimal contraction order and streamlining the contraction processes to achieve high-performance calculations.
A powerful parallel algorithm for the contraction of tensor networks is stem optimization (Alibaba_19days, ), which involves a path of contractions that dominates the overall computational cost. The tensor-based algorithm dynamically decomposes the computational task into numerous smaller subtasks that can be executed in parallel on a cluster. Although (Alibaba_19days, ) demonstrated improvements over conventional methods, these subtasks still involved intense computations within a single node. (Sunway_304s, ) introduced an optimized slicing strategy that reduced the space complexity of sliced tensors and alleviated embarrassing parallelism among subtasks. This strategy, combined with better contraction paths and custom optimizations, allowed for more efficient computing performance than previous methods, such as those provided by the CoTenGra software (cotengra, ).
Pan et al. (512GPUs_15h, ) considered quantum circuits as a -dimensional tensor network, from which they broke edges to form the sub-network. This process can be visualized as drilling holes in the 3D graphical representation of the network. They introduced a method for sparse-state tensor contraction, which allowed for calculating amplitudes of many uncorrelated bitstrings in a single step. This approach is notably more efficient for producing numerous uncorrelated samples than previous techniques.
Furthermore, (leapfrogging, ) presented a post-processing/post-selection technique where they calculated the probability distribution for each string accurately and obtained top k strings with the largest output probabilities. Through implementing this approach, it was found that it was necessary to execute merely 0.03% of the total tasks (i.e., contract 0.03% sub-networks of total networks) out of a total of subtasks to reach an XEB value of 0.002. As a result, they significantly improved the cross entropy benchmarking (XEB) values and achieved a groundbreaking faster solution, which has motivated the methodology of this work.
2.3. Classical versus quantum: tensor-network-based experiments
Numerous simulations of random quantum circuits have been specifically designed to bridge the quantum-classical simulation divide. Based on tensor network algorithms, ZuChongzhi 2.0 (zuchongzhi56qubit, ) and Zuchongzhi 2.1 (zuchongzhi60qubit, ) reached up to a simulation scale of 56-qubits 20-cycles and 60-qubit 24-cycle, respectively. Specifically, Zuchongzhi 2.1 required an estimated floating point operations to generate one perfect sample. The random circuit sampling task was completed by Zuchongzhi 2.1 in 4.2 hours and achieved XEB fidelity of .
Alibaba presented a tensor network-based classical simulation algorithm (Alibaba_19days, ). Through runtime testing of subtasks, it was estimated that the Sycamore task (53 qubits, 20 cycles circuit with a fidelity of 0.2%) can be accomplished in 19.3 days using the Summit cluster. Xin Liu et al. (liu2021redefining, ) employed contengra to determine a near-optimal tensor contraction order for computational purposes. Consequently, they produced 2,000 perfect samples (or an equivalent of 1 million samples with 0.2% fidelity) within a span of 6.4 days on the latest generation of Sunway’s supercomputer. Yong Liu et al. developed an RQC simulator on the advanced Sunway supercomputer. They introduced an optimized path strategy based on tensor network simulation to balance the memory requirements and the number of concurrent computations, resulting in a remarkable achievement of sampling 1 million correlated samples of the Google Sycamore RQC task in just 304 seconds (Sunway_304s, ).
Targeting Google’s Sycamore RQC problem, Pan et al. proposed the big-head approach(60GPUs_5days, ), using 60 GPUs for 5 days, they generated correlated samples with XEB 0.739, and passed the XEB test. Another tensor network method (512GPUs_15h, ) was designed where uncorrelated bitstrings achieved fidelity in a time-cost of 15 hours on a computational cluster with 512 GPUs. Pan et al. also proposed an optimization strategy that transfers the most time-consuming component of the sparse-state tensor network simulation from sparse Einstein summation (einsum) to matrix multiply operations (Alibaba_19days, ). This optimization was applied to the simulation of 53-qubit, 20-cycle Sycamore circuits, using 2819 A100 GPU hours to verify three million sampled bitstrings.
A recent work (leapfrogging, ) developed a post-processing algorithm aimed at enhancing XEB values. This algorithm, when implemented with 1,432 NVIDIA A100 GPUs, was capable of yielding 3 million uncorrelated samples with XEB values of in a duration of 86.4 seconds, while consuming 13.7 kWh of electricity. To further trade between spatial complexity and temporal complexity, we extend the work (leapfrogging, ) through examining the contraction path’s time complexity under predetermined memory limits. While Fig. 2 (b) presents the frequency distribution of all the contraction paths’ time complexities under memory sizes from 64GB to 2PB, each point in Fig. 2 (a) illustrates the optimal contraction path with the least time complexity within a certain memory constraints. As the available memory increases by a factor of , the time complexity of the optimal contraction path initially decreases rapidly and then gradually diminishes, eventually converging beyond 32TB memory limits. This observation highlights the potential of harnessing more memory resources for faster computing.
These RQC simulation described above has made significant progress in estimated or actual computation time, where the gap with Sycamore is no longer substantial. Nonetheless, even with the state-of-the-art approach, there remains an order of magnitude energy efficiency gap between classical computers and quantum computers in the sampling of RQC tasks. As (leapfrogging, ) has solidified the outstanding efficiency of tensor network calculations, our subsequent sections will generate uncorrelated samples based on the optimal solutions for 4TB and 32TB tensor network (marked as hollow pentagrams with red and blue color in Fig. 2 (a)).
3. Innovations
The optimal contraction path searching and edge splitting algorithms for the tensor network are based on (512GPUs_15h, ). In these methodologies, we employ the technique of drilling holes or breaking edges in the original 3D tensor network to decrease the complexity of a sub-task. As explained in Subsection 2.3, leveraging spatial capacity of tensor networks can greatly reduce the computational complexity, for example 4TB and 32TB tensor network. However, scaling up the tensor network introduces communication overhead, posing a new bottleneck in inter-node communication within the GPU cluster.
To address this challenge, this section proposes various solutions to mitigate inter-node communication issues and harness the potential of the GPU cluster through a three-level parallel scheme embedded with hybrid communication algorithm. Furthermore, we optimize computing resources by converting complex single-precision calculations into complex half-precision calculations. Additionally, we present tailored remedies for specific scenarios encountered in our application domain.
3.1. Three-level parallelization scheme

We categorize our architecture into three hierarchical levels as depicted in Fig. 4 (a): global level, multi-node level, and device level.
Consistent with the approaches described by (512GPUs_15h, ), after identifying a near-optimal path, we initially dissect the contraction of the overall tensor network into independent sub-networks (eg., 4T and 32T ), which can be executed on separate multi-nodes, labeled as the multi-node level. Each sub-task, corresponding to the contraction of an independently sliced tensor, is allocated to the multi-node level (indicated by the solid orange rectangle in Fig. 4). Herein, each multi-node level task encompasses the entire sub-network contraction process and is distributed across multiple interconnected nodes via an InfiniBand (IB) network.
In the context of the tensor network, the term ”stem path” refers to a sequence of expensive nodes that dominate the overall computation and memory cost (Alibaba_19days, ), where ”stem tensor” is the tensor associated with the stem path.
Given a rank-n stem tensor , each of whose rank having a dimension of 2, the first ranks (also referred to as modes) signifies further division of the tensor into segments. After sliced into a single node, the stem tensor converts to .
The final level of parallelization among disparate devices, exemplified by the dashed green rectangle in Fig. 4 involves specific matrix multiplication and index permutation operations executed on each GPU, interconnected through a high-bandwidth NVLink. Here, stem tensor is further segmented into a reasonable size that can fit into a single device by trimming the subsequent modes, thereby yielding .
Before launching the final network, we introduce a pre-processing step, which involves a hybrid communication algorithm that determine and mode according to the bandwidth of different storage levels.
As shown in Fig. 4, in our classical computational setup, GPUs within a node are connected via NVLink, which has a bandwidth of 300 GB/s. Nodes are connected through InfiniBand, which in our case operates at a bandwidth of 100 GB/s. These IB links are shared by 8 GPUs. As a result, inter-node communication is one order of magnitude slower than intra-node communication.
Our approach divides the workload between intra-node and inter-node communication. This approach takes advantage of the higher bandwidth provided by NVLink, thereby optimizing performance overall. The hybrid communication algorithm that combines both intra-node and inter-node communication for each step is shown in Algo 1.
Fig. 4 (b) illustrates an example of the rearrangement pattern wherein both and are equal to 1. In this scenario, is related to the inter mode, while is associated with the intra mode. During a single tensor contraction, no inter-node communication is required if the first- modes remain uncontracted. However, if any of the following modes are contracted, we rearrange the tensor by swapping the next modes with the following and distribute the tensor based on the new modes. When some of the first- modes are contracted, we rearrange the tensor by swapping the first- modes with the next and distribute it accordingly based on the new modes.
3.2. Customized Low-Precision Communication
When the first- modes are contracted, we need to perform an all-to-all communication between multiple nodes as shown in Fig. 4, serving as a dominant factor in both time and energy usage. For example, in a 4T tensor network, the inter-node communication time accounts for up to 60.0%, while energy consumption is around 35%. Consequently, the time and energy consumption of inter-node communication become bottlenecks. Here we provide low-precision quantization skills to alleviate the cost of inter-node communication.
In low-precision quantization, utilizing a single scale and zero offset for the whole tensor could result in substantial computational losses. Therefore, quantizers will divide the tensor into multiple groups, and apply quantization to each group, whose tensor is referred to as group tensor. According to classic tensor quantization methods (Vectorquantization, ), we apply the following general formulation for quantization operator applied to the -th group of tensor :
(1) |
where is the scale factor, and are the range of quantization value we map, and and are the maximum and minimum values of the group tensor . The zero-point is , and exp is an optional parameter to perform exponential non-linear operation.
Based on the above equation, we provide three types of quantization: float2half, float2int8, and float2int4 (see Table 1). The table displays the optimal quantization parameters for each type. For half and int8 quantization, we compute a global scale and zero-point factor for the entire tensor. In contrast, int4 quantization requires a more granular quantization within each group of the tensor rather than the entire tensor, which significantly minimizes fidelity loss (GDRQ, ). As different group has different parameters, smaller group sizes result in better fidelity due to tailored scaling and zero-points, but leads to communication overhead. Additionally, rounding is incorporated when performing int-quantization.
Type | Range | Exp | Group | Round |
---|---|---|---|---|
float | – | – | – | |
float2half | 1 | entire tensor | false | |
float2int8 | 0.2 | entire tensor | true | |
float2int4 | 1 | group tensor | true |
We further enhance performance by crafting custom kernels for all the quantization type above. Dealing with complex-value data, we optimize the kernels through vectorized memory access and fine-tuning for achieving maximum bandwidth utilization and reducing latency. Finally, by adopting int4 quantization kernel, the communication time is decreased by over 85% with minimal fidelity loss (compared with float), proving an advantage for various cases.
3.3. Expanding Einsum Paradigm for Complex-Half Precision
Benefiting from the absence of uncontrollable noise, complex-half precision can be adopted in classical quantum simulation to minimize the memory demand. However, complex-half extensions are not directly applicable. We delve into the einsum paradigm and extend it for complex-half precision support.
Assume that there are three tensors and , whose ranks (referred to as modes in the context of einsum equation) are and , respectively. Let be the number of reduction indices. A general einsum equation
(2) |
can be expressed as a superposition as follows:
(3) |
where are the indices to be reduced. It is crucial to recognize that the formula can only be transformed into a pure General Matrix Multiply (GEMM) operation, not a combination of GEMM and reduction, when the following condition is met: . Moreover,
(4) |
are the remaining indices for both tensors and .
To optimize space complexity and enhance calculation speed, a straightforward approach could involve swapping single-precision floating-point numbers with half-precision ones. However, we encountered a lack of support for contracting complex half-precision numbers within high-performance computing libraries. This deficiency persists even in well-known libraries. Some libraries like PyTorch support complex half-precision calculations by splitting into real and imaginary parts, which are inefficient due to multiple reads/writes and handling discontinuous data. We introduce a new solution to these challenges.
To enhance readability, we generally denote as the larger input, as the smaller one, and as the output. In this case, ’s IO operation is often negligible compared to that of and , so our primary focus is on data access for and .
A straightforward attempt is to simply append a mode that indicates whether it represents the real or imaginary part at the end of each component in Equation 2, as shown in Equation 5. However, this approach is incorrect since and are the indices to be reduced, and is the remaining index, we do not have any index generating in either of the inputs, which necessitates a modification of Equation 5:
(5) |
Since introducing a supplementary mode necessitates duplicating the size in bytes, the most efficient approach is to include an extra mode for tensor B, as it is smaller than tensor A. The final equation becomes:
(6) |
where tensor is padding from to
. Here, we substitute with because they are identical by definition from Equation 4.
Here’s an example, where we have , with being and being . This leads to a result of . Following the provided method, we transform the corresponding complex tensor into and rearrange tensor to . By applying the modified equation , we obtain the same exact result, .
3.4. Optimization for special cases
We introduce several specialized skills. They are not as generalized as previous introduces methods, but they are useful for certain tasks.
3.4.1. Recomputation
In the 4T tensor network, we observe the following features: there are only four steps with values above 1T, one of which is 2T; 2. During and after these four steps, there is no data communication. Based on this information, we designed recomputation techniques. The main idea is that instead of calculating the entire large tensors at a time, we compute half of them each time. Specifically, we begin at the start, just before the generation of the 1T tensor. We store and compute only half of the tensors and proceed until the end. Then, we restart from the middle, read the remaining half, and calculate the second half. Finally, we concatenate the two parts. This technique substantially reduces the necessary nodes by two, concurrently diminishing the data volume for all-to-all communication, considering that is concurrently diminished by 1.
3.4.2. Tensor contraction during sparse state
As described in (512GPUs_15h, ), sparse state occurs in the final stage of tensor network computation. It is inherently discontinuous and repetitive, requiring the copying of tensors for computations. However, due to the allocation of a double-buffer, the GPU memory is nearly exhausted and cannot support the loading of large tensor calculations, so we divide the larger tensor into smaller chunks that can fit into the current GPU memory, and the number of chunks is determined by the current remaining capacity of the GPU memory. Then we compute each tensor chunk iteratively.

Tensor contraction involves dimension reordering and matrix multiplication (Sunway_304s, ). Then, with the enhancement of the NVIDIA A100, we are no longer limited to the contraction between a pair of tensors. By utilizing the cuTensor library function, we can perform computations on multiple pairs of tensors simultaneously. As shown in Fig. 5, we can obtain multiple tensors at specified positions through the indices and to form the large tensors and before performing matrix multiplication. We have designed two indexing schemes to accommodate different index distribution scenarios.
Let A[a,c,d,f] and B[b,e,f] be input tensors, C[n,c,d,e] be the contraction product, all with two dimensions except the outermost ranks a, b, n, and let f be a common rank to be contracted. and respectively are chunk-size of A and B, so the is the same of C. Usually, and are arrays with integers from 0 to or , then . However, when there is a large amount of repeated data in the Index corresponding to input tensors of high-rank, just like , it will be very expensive in this way. So we use the input tensor A directly. In the , . Thus, for the rightness of retrieving , we need padding to new 2d-index which size is ( is max repeat count of one number in ), and excess positions are replaced with -1. In this case, is 3 since 1 had appeared 3 times. To avoid ambiguity, we call this new as , and because B as smaller tensor compared with A, so it won’t increasing too much costs in this step. Finally, we got , then C can derive from through flatten its outside ranks a, c and then extract valid tensors in it.
4. Evaluation
We begin by presenting our experimental setup. Following that, we discuss the tools utilized in our performance analysis. Next, we conduct an evaluation of low-precision communication and provide usage advice for communication in RQC simulation. We proceed to assess the proposed technique step by step and identify the optimal solution for a single multi-node level subtask. Finally, we present and analyze the results of testing two large-scale tensor networks (specifically, 4T and 32T).
4.1. Experiment setup
We employ 80GB A100 GPUs, featuring peak FP16 Tensor core performance of 312 TFLOPS. Eight A100 GPUs within each node are interconnected via NVlink, providing a unidirectional speed of 300 GB/s, while nodes are connected via InfiniBand with a unidirectional speed of 100 GB/s. The task is implemented based on Pytorch framework (version 2.1), einsum calculation is performed using Cutensor (version 1.7.0) with cudatoolkit of version 12.0.
4.2. Measurement tools
For monitoring GPU power consumption in real-time, we have adopted a method that creates a subprocess continuously invoking the NVML library (nvml, ) with Python interface. This subprocess runs on the device level, passing its own machine rank to a specific function as a parameter, so the function can capture and store both the relative timestamp and the instantaneous power for each graphics card, performing these tasks at approximately 20-millisecond intervals. Since it is implemented by initiating a separate process, it almost does not impact the performance of the main process.
At the end of the tasks, we used the method of infinitesimal integration to calculate the total energy consumption throughout the process and sum up it on the global level as a result, which can reflect the consumption of the entire operation process. In Table 2, we present the measured power under three distinct scenarios.
Power per A100 GPU | |
---|---|
Idle | 60 W |
Communication | 90135W |
Computation | 220450W |
Based on our designed quantization method in Subsection 3.2, we define a quantization compression rate (CR (%), ), which denotes the percentage of data to be compressed for communication,
(7) |
where represents the tensor.
The results are compared to the benchmarks using the similarity function presented below, with similarity denoted as fidelity:
(8) |
4.3. Assessment for low-precision communication
In this section, we explore the most effective method of data quantization for communication with the dual goals of minimizing energy consumption and achieving high-fidelity results. First, we analyze precision sensitivity within the stem path of the 4T tensor network. We then evaluate the impact of quantization for both intra-node and inter-node communication, and identify the quantization strategy for tackling the fidelity-energy trade-off.
4.3.1. Single step of quantization
We conducted stepwise quantization experiments based on single-step quantization (see Fig. 6).

Relative fidelity is less stable and has lower performance in the early stage of the task. The reason is that quantifying in the early stages of the task will result in more error accumulation than quantifying in the later stages of the task. Furthermore, the relative fidelity is independent of the amount of communicated data. Therefore, we attempt to quantify in the later stages of the task and choose a larger amount of data for quantification as much as possible to achieve higher returns. The marked dashed line on the graph is the quantization steps we ultimately chose.
4.3.2. Assessment for quantization of intra-node communication
The time of all-to-all communication is
(9) |
where represents the number of devices within a node, which is 8 in this case, and r denotes the bandwidth utilization rate in all-to-all communication, which is approximately 50% in practice. The energy consumption is proportional to:
(10) |
where and are the coefficients of power consumption. Empirical data shows that the ratio is approximately equal to . For every 1GB of data, the quantization kernel takes 4.25ms to process, whereas the reduction in communication time resulting from quantization, as per equation 9, is a mere 4.78ms. This suggests that the time-saving achieved through quantization is, indeed, insignificant.
Upon further analysis, taking into consideration Equation 10, the increase in overall energy consumption, denoted by the inequality , results in a decline in performance. This revelation, combined with the earlier mentioned reduction in fidelity as described in Section 4.3.1, conclusively demonstrates that implementing quantization techniques within intra-node communication yields negative outcomes. As a result, the utilization of quantization for intra-node communication is not considered advantageous, and would not be adopted in the final task.
4.3.3. Assessment for quantization of inter-node communication
Despite the negative benefits of quantization during intra-node communication, Equations 9 and 10 suggest that quantization can have positive effects on inter-node communication. To provide a more precise and concrete demonstration, we conducted an end-to-end subtask of a 4T TN. The result is shown in Fig. 7.
We observe that the total time and energy consumption gradually decrease from float-precision to int4 (128), but remain relatively constant after int4 (128). The performance improvement from float-precision to int4 (128) is greater than the decrease in relative fidelity. Meanwhile, the improvement after int4 (256) is smaller than the decrease in relative fidelity. In summary, utilizing int4 with 128 being the group size for inter-node quantization yields the maximum positive impact. We adopt this as the final quantization scheme for our task, with a 6.55% loss in relative fidelity, the time decreased by 50.08%, and the energy consumption decreased by 30.23%.

Data type of computation | Data type of communication | Hybrid communication | Other optimizations | nodes | Energy (wh) | Fidelity(%) | ||
Yes/No? | Inter (GB)/GPU | Intra (GB)/GPU | ||||||
float | float | No | 36 | – | No | 8 | 19.78 | 1 |
float | half | No | 18 | – | No | 8 | 16.48 | 99.999 |
half | half | No | 36 | — | No | 4 | 13.03 | 99.995 |
half | half | Yes | 28 | 20 | No | 4 | 12.67 | 99.995 |
half | half | Yes | 24 | 40 | Yes | 2 | 10.57 | 99.965 |
half | int8 | Yes | 12 | 40 | Yes | 2 | 10.12 | 99.912 |
half | int4(128) | Yes | 6 | 40 | Yes | 2 | 9.89 | 98.007 |
4.4. Assessment of the proposed techniques
In this section, we aim to demonstrate how the proposed innovations improve performance incrementally. As shown in Table 3, each row represents a single subtask. We start from the baseline case, where no optimizations or quantizations are applied, using single-precision floating-point data types and setting the result as a benchmark.
Initially, we decreased the quantity of data by converting single-precision to half-precision for communication, hence leading to a 16.68% reduction in energy consumption with negligible loss of fidelity. Additionally, we implement half-precision for computational tasks, reducing the minimum number of nodes needed for a sub-network from 8 to 4. This change leads to a 20.93% decrease in energy usage while maintaining a negligible loss of accuracy.
Next, we split inter-node communication tasks with intra-node communication, as certain tensor network calculation steps do not require data to be permuted across all nodes. By redirecting some communication burden to intra-node, we observe an additional 2.76% reduction in energy consumption with minimal fidelity degradation. Subsequently, by leveraging the features of the 4T network, we implement a recomputation algorithm that reduces the minimal number of nodes required for a sub-network from 4 to 2, achieving a 16.57% energy reduction with no significant loss of fidelity.
Our final experiment employs int8 and int4 quantization for communication, where energy reductions of 4.25% and 6.43%, respectively, are achieved with fidelity losses within 2%. In conclusion, the proposed innovations progressively contribute to improving performance, enabling substantial energy savings and maintaining high fidelity in tensor contraction operations.
4.5. Verification and simulation of Sycamore circuits
methods |
|
|
|
|
||||||||
Time complexity (FLOP) | ||||||||||||
Memory complexity (elements) | ||||||||||||
XEB value (%) | 0.2036 | 0.2059 | 0.21194 | 0.2158 | ||||||||
Efficiency (%) | 21.09 | 18.14 | 16.65 | 17.09 | ||||||||
Total number of subtasks | ||||||||||||
Number of subtasks conducted | 528 | 84 | 9 | 1 | ||||||||
Nodes per subtask | 2 | 2 | 32 | 32 | ||||||||
Memory/Multi-node level (TB) | 1.25 | 1.25 | 20 | 20 | ||||||||
Computer resource (A100) | 2112 | 96 | 2304 | 256 | ||||||||
Time-to-solution (s) | 32.51 | 133.15 | 14.22 | 17.18 | ||||||||
Energy consumption (kwh) | 5.77 | 1.12 | 2.39 | 0.29 |
In order to sample from the Sycamore circuit, we have applied our techniques to two large-scale tensor networks, namely 4T and 32T, with or without post selection (also known as post processing) (leapfrogging, ), aiming for uncorrelated samples. Our experiment has attained a peak half-precision performance of 561 PFLOPS, with approximately 20% of efficiency achievable through the application of the proposed techniques. A summary of our verification details is presented in Table 4, where bold numbers indicate that the metrics performed than Sycamore. Therefore, all test cases have surpassed the time consumed over Sycamore, and three cases have surpassed energy consumption. Among these, our best case demonstrates one magnitude of order less in both time and energy compared to Sycamore (32T with post-processing).
We compared the four cases of our simulation for the task of generating bitstring samples with a bounded fidelity of 0.002. Our analysis focused on three key aspects: the influence of post-selection, the size of the tensor network, and the scalability of the approach.
4.5.1. Assessment of Post-processing
As previously mentioned, the extensive tensor network can be divided into self-contained sub-networks, each of which is capable of being contracted at a single multi-node level. In the execution of our simulation, by utilizing a post-selection approach, it is only necessary to perform approximately of the tasks that would have been required without post-selection to achieve an XEB value of 0.002. Thus, comparing the methods with and without post-selection, we can conclude that employing post-selection significantly reduces both time and space complexity.
4.5.2. Assessment of size of tensor network
The space complexity of the algorithm is proportional to , where s represents the size of the data type and M corresponds to the targeted contraction treewidth, which can be manually specified in the slicing algorithm. Consequently, the space complexity is completely controlled. Comparing methods between different sizes of tensor networks, it is evident that the time and space complexity decrease as the size of the tensor network increases (at the global level, not for a single multi-node level).

4.5.3. Scalibility Exploration
In Fig. 8, we observe a linear decay in the total time-to-solution as we increase the number of GPUs used for computation, due to the parallel-friendly feature of the slicing algorithm and three-level parallel scheme. Specifically, our method achieves strong scalability from 128 GPUs (16 nodes) to 768 GPUs (96 nodes) for 4T tensor network with post-selection, from 271 GPUs (34 nodes) to 2112 GPUs (264 nodes) for 4T tensor network without post-selection, from 256 GPUs ( 2 nodes) to 2304 GPUs (288 nodes) for 32T tensor network without post-selection in terms of time. Moreover, the energy consumption remains at a constant level with an increase in nodes.
5. Conclusion
Previous studies have predominantly focused on identifying optimal paths for classical devices to efficiently compute tensor networks and store tensors within time and memory constraints, from an algorithmic perspective. In contrast, our work aims to enhance the capabilities of multi-node computational clusters and reduce computational complexity. This is achieved through the implementation of a three-level parallel scheme coupled with a carefully crafted inter-node and intra-node hybrid communication approach.
To significantly mitigate overhead costs associated with large tensors, we propose a low-precision communication approach and an einsum extension for complex-half data. Additionally, we introduce several specialized techniques to expedite specific tasks.
In our best-case without post-processing, we have reduced the time-to-solution to 14.22 seconds, with a power consumption of 2.39 kwh with fidelity of 0.002. Then with the technique of post-processing, we achieve our best performance, we remarkably reduce the time-to-solution for sampling bitstrings to just 17.18 seconds, with a power consumption of 0.29 kWh. This represents a one-order-of-magnitude reduction in both time and energy compared to the Google’s quantum processor Sycamore.
In this work, by leveraging state-of-the-art algorithms and a highly optimized scalable tensor network computation system based on GPU clusters, we have significantly surpassed Sycamore in both computing speed and energy consumption. We experimentally challenge Google’s initial claim of quantum supremacy. Our work resets the boundaries of classical computing capabilities and we hope to help the community evaluate quantum advantage more robustly in future quantum computational advantage experiments.
In future work, our techniques supporting large-scale tensor networks can be extended beyond merely RQC sampling simulation. They can be directly applied to diverse fields like quantum computing simulator (guerreschi2020intel, ), condensed matter physics (liu2021tropical, ) and combinatorial optimization (liu2023computing, ). This extended application intends to empower the scalability of problem models and facilitate the resolution of realistic challenges such as satisfiability, set packing, clique problems, etc.
Acknowledgements.
This work was supported by the National Key RD Program of China (No. 2022ZD0160201, No. 2022ZD0160101).References
- [1] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C Bardin, Rami Barends, Rupak Biswas, Sergio Boixo, Fernando GSL Brandao, David A Buell, et al. Quantum supremacy using a programmable superconducting processor. Nature, 574(7779):505–510, 2019.
- [2] Han-Sen Zhong, Hui Wang, Yu-Hao Deng, Ming-Cheng Chen, Li-Chao Peng, Yi-Han Luo, Jian Qin, Dian Wu, Xing Ding, Yi Hu, et al. Quantum computational advantage using photons. Science, 370(6523):1460–1463, 2020.
- [3] Han-Sen Zhong, Yu-Hao Deng, Jian Qin, Hui Wang, Ming-Cheng Chen, Li-Chao Peng, Yi-Han Luo, Dian Wu, Si-Qiu Gong, Hao Su, et al. Phase-programmable gaussian boson sampling using stimulated squeezed light. Physical review letters, 127(18):180502, 2021.
- [4] Yu-Hao Deng, Yi-Chao Gu, Hua-Liang Liu, Si-Qiu Gong, Hao Su, Zhi-Jiong Zhang, Hao-Yang Tang, Meng-Hao Jia, Jia-Min Xu, Ming-Cheng Chen, et al. Gaussian boson sampling with pseudo-photon-number-resolving detectors and quantum computational advantage. Physical review letters, 131(15):150601, 2023.
- [5] Yulin Wu, Wan-Su Bao, Sirui Cao, Fusheng Chen, Ming-Cheng Chen, Xiawei Chen, Tung-Hsun Chung, Hui Deng, Yajie Du, Daojin Fan, et al. Strong quantum computational advantage using a superconducting quantum processor. Physical review letters, 127(18):180501, 2021.
- [6] Qingling Zhu, Sirui Cao, Fusheng Chen, Ming-Cheng Chen, Xiawei Chen, Tung-Hsun Chung, Hui Deng, Yajie Du, Daojin Fan, Ming Gong, et al. Quantum computational advantage via 60-qubit 24-cycle random circuit sampling. Science bulletin, 67(3):240–245, 2022.
- [7] Koen Bertels, Aritra Sarkar, and Imran Ashraf. Quantum computing—from nisq to pisq. IEEE Micro, 41(5):24–32, 2021.
- [8] Alexander Zlokapa, Benjamin Villalonga, Sergio Boixo, and Daniel A Lidar. Boundaries of quantum supremacy via random circuit sampling. npj Quantum Information, 9(1):36, 2023.
- [9] Benjamin Villalonga, Dmitry Lyakh, Sergio Boixo, Hartmut Neven, Travis S Humble, Rupak Biswas, Eleanor G Rieffel, Alan Ho, and Salvatore Mandrà. Establishing the quantum supremacy frontier with a 281 pflop/s simulation. Quantum Science and Technology, 5(3):034003, 2020.
- [10] Thomas Häner and Damian S Steiger. 0.5 petabyte simulation of a 45-qubit quantum circuit. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–10, 2017.
- [11] Xin-Chuan Wu, Sheng Di, Emma Maitreyee Dasgupta, Franck Cappello, Hal Finkel, Yuri Alexeev, and Frederic T Chong. Full-state quantum circuit simulation by using data compression. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–24, 2019.
- [12] Cupjin Huang, Fang Zhang, Michael Newman, Junjie Cai, Xun Gao, Zhengxiong Tian, Junyin Wu, Haihong Xu, Huanjun Yu, Bo Yuan, et al. Classical simulation of quantum supremacy circuits. arXiv preprint arXiv:2005.06787, 2020.
- [13] Yong Liu, Xin Liu, Fang Li, Haohuan Fu, Yuling Yang, Jiawei Song, Pengpeng Zhao, Zhen Wang, Dajia Peng, Huarong Chen, et al. Closing the ”quantum supremacy” gap: achieving real-time simulation of a random quantum circuit using a new sunway supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2021.
- [14] Yong Liu, Yaojian Chen, Chu Guo, Jiawei Song, Xinmin Shi, Lin Gan, Wenzhao Wu, Wei Wu, Haohuan Fu, Xin Liu, et al. Validating quantum-supremacy experiments with exact and fast tensor network contraction. arXiv preprint arXiv:2212.04749, 2022.
- [15] Chu Guo, Yong Liu, Min Xiong, Shichuan Xue, Xiang Fu, Anqi Huang, Xiaogang Qiang, Ping Xu, Junhua Liu, Shenggen Zheng, et al. General-purpose quantum circuit simulator with projected entangled-pair states and the quantum supremacy frontier. Physical review letters, 123(19):190501, 2019.
- [16] Riling Li, Bujiao Wu, Mingsheng Ying, Xiaoming Sun, and Guangwen Yang. Quantum supremacy circuit simulation on sunway taihulight. IEEE Transactions on Parallel and Distributed Systems, 31(4):805–816, 2019.
- [17] Feng Pan, Keyang Chen, and Pan Zhang. Solving the sampling problem of the sycamore quantum circuits. Phys. Rev. Lett., 129:090502, Aug 2022.
- [18] Feng Pan and Pan Zhang. Simulation of quantum circuits using the big-batch tensor network method. Phys. Rev. Lett., 128:030501, Jan 2022.
- [19] Xian-He Zhao, Han-Sen Zhong, Feng Pan, Zi-Han Chen, Rong Fu, Zhongling Su, Xiaotong Xie, Chaoxing Zhao, Pan Zhang, Wanli Ouyang, et al. Leapfrogging sycamore: Harnessing 1432 gpus for 7 faster quantum random circuit sampling. arXiv preprint arXiv:2406.18889, 2024.
- [20] Xun Gao, Marcin Kalinowski, Chi-Ning Chou, Mikhail D. Lukin, Boaz Barak, and Soonwon Choi. Limitations of linear cross-entropy as a measure for quantum advantage. PRX Quantum, 5:010334, Feb 2024.
- [21] Yaojian Chen, Yong Liu, Xinmin Shi, Jiawei Song, Xin Liu, Lin Gan, Chu Guo, Haohuan Fu, Jie Gao, Dexun Chen, and Guangwen Yang. Lifetime-based optimization for simulating quantum circuits on a new sunway supercomputer. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP ’23, page 148–159, New York, NY, USA, 2023. Association for Computing Machinery.
- [22] Sebastian Brandhofer, Ilia Polian, and Kevin Krsulich. Optimal partitioning of quantum circuits using gate cuts and wire cuts. IEEE Transactions on Quantum Engineering, 2023.
- [23] Feng Pan, Hanfeng Gu, Lvlin Kuang, Bing Liu, and Pan Zhang. Efficient quantum circuit simulation by tensor network methods on modern gpus. arXiv preprint arXiv:2310.03978, 2023.
- [24] Hans De Raedt, Fengping Jin, Dennis Willsch, Madita Willsch, Naoki Yoshioka, Nobuyasu Ito, Shengjun Yuan, and Kristel Michielsen. Massively parallel quantum computer simulator, eleven years later. Computer Physics Communications, 237:47–61, 2019.
- [25] Guifré Vidal. Efficient classical simulation of slightly entangled quantum computations. Physical review letters, 91(14):147902, 2003.
- [26] Koen De Raedt, Kristel Michielsen, Hans De Raedt, Binh Trieu, Guido Arnold, Marcus Richter, Th Lippert, Hiroshi Watanabe, and Nobuyasu Ito. Massively parallel quantum computer simulator. Computer Physics Communications, 176(2):121–136, 2007.
- [27] Benjamin Villalonga, Sergio Boixo, Bron Nelson, Christopher Henze, Eleanor Rieffel, Rupak Biswas, and Salvatore Mandrà. A flexible high-performance simulator for verifying and benchmarking quantum circuits implemented on real hardware. npj Quantum Information, 5(1):86, 2019.
- [28] Jianxin Chen, Fang Zhang, Cupjin Huang, Michael Newman, and Yaoyun Shi. Classical simulation of intermediate-size quantum circuits. arXiv preprint arXiv:1805.01450, 2018.
- [29] Johnnie Gray and Stefanos Kourtis. Hyper-optimized tensor network contraction. Quantum, 5:410, 2021.
- [30] Xin Liu, Chu Guo, Yong Liu, Yuling Yang, Jiawei Song, Jie Gao, Zhen Wang, Wenzhao Wu, Dajia Peng, Pengpeng Zhao, et al. Redefining the quantum supremacy baseline with a new generation sunway supercomputer. arXiv preprint arXiv:2111.01066, 2021.
- [31] Ze-bin Wu and Jun-qing Yu. Vector quantization: a review. Frontiers of Information Technology & Electronic Engineering, 20(4):507–524, 2019.
- [32] Haibao Yu, Tuopu Wen, Guangliang Cheng, Jiankai Sun, Qi Han, and Jianping Shi. Gdrq: Group-based distribution reshaping for quantization. arXiv preprint arXiv:1908.01477, 2019.
- [33] Nvidia management library (nvml). https://developer.nvidia.com/nvidia-management-library-nvml. Accessed: 2024-03-27.
- [34] Gian Giacomo Guerreschi, Justin Hogaboam, Fabio Baruffa, and Nicolas PD Sawaya. Intel quantum simulator: A cloud-ready high-performance simulator of quantum circuits. Quantum Science and Technology, 5(3):034007, 2020.
- [35] Jin-Guo Liu, Lei Wang, and Pan Zhang. Tropical tensor network for ground states of spin glasses. Physical Review Letters, 126(9):090506, 2021.
- [36] Jin-Guo Liu, Xun Gao, Madelyn Cain, Mikhail D Lukin, and Sheng-Tao Wang. Computing solution space properties of combinatorial optimization problems via generic tensor networks. SIAM Journal on Scientific Computing, 45(3):A1239–A1270, 2023.