License: CC BY 4.0
arXiv:2407.00769v1 [quant-ph] 30 Jun 2024

Achieving Energetic Superiority Through System-Level Quantum Circuit Simulation

Rong Fu1  Zhongling Su1  Han-Sen Zhong1,†  Xiti Zhao1  Jianyang Zhang1  Feng Pan2  Pan Zhang3  Xianhe Zhao2,4,5  Ming-Cheng Chen2,4,5  Chao-Yang Lu2,4,5,‡  Jian-Wei Pan2,4,5  Zhiling Pei1,§  Xingcheng Zhang1,¶  Wanli Ouyang1 1Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China. 2Shanghai Research Center for Quantum Science and CAS Center for Excellence in Quantum Information and Quantum Physics, University of Science and Technology of China, Shanghai, 201315, China. 3CAS Key Laboratory for Theoretical Physics, Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing, 100190, China. 4Hefei National Research Center for Physical Sciences at the Microscale and School of Physical Sciences, University of Science and Technology of China, Hefei, 230026, China. 5Hefei National Laboratory, University of Science and Technology of China, Hefei, 230088, China. zhonghansen@pjlab.org.cn  cylu@ustc.edu.cn  1,§peizhiling@pjlab.org.cn  zhangxingcheng@pjlab.org.cn
Abstract.

Quantum Computational Superiority boasts rapid computation and high energy efficiency. Despite recent advances in classical algorithms aimed at refuting the milestone claim of Google’s sycamore, challenges remain in generating uncorrelated samples of random quantum circuits.

In this paper, we present a groundbreaking large-scale system technology that leverages optimization on global, node, and device levels to achieve unprecedented scalability for tensor networks. This enables the handling of large-scale tensor networks with memory capacities reaching tens of terabytes, surpassing memory space constraints on a single node. Our techniques enable accommodating large-scale tensor networks with up to tens of terabytes of memory, reaching up to 2304 GPUs with a peak computing power of 561 PFLOPS half-precision. Notably, we have achieved a time-to-solution of 14.22 seconds with energy consumption of 2.39 kWh which achieved fidelity of 0.002 and our most remarkable result is a time-to-solution of 17.18 seconds, with energy consumption of only 0.29 kWh which achieved a XEB of 0.002 after post-processing, outperforming Google’s quantum processor Sycamore in both speed and energy efficiency, which recorded 600 seconds and 4.3 kWh, respectively.

Quantum random circuit sampling, Tensor networks, Parallel computing, Energy consumption, Quantization, Low-precision communication

1. OVERVIEW: A New Era of Speed and Efficiency in Circuit Modeling

In the field of quantum computing, Google’s claim of ”quantum supremacy” with its Sycamore quantum processor marks a crucial advancement (Quantum_supremacy, ). It generated 3 (1) million uncorrelated samples within 600 (200) seconds, which was claimed to require the world’s most powerful supercomputer approximately 10,000 years to complete. This achievement not only propels advancements in quantum computing (jiuzhang1, ; jiuzhang2, ; jiuzhang3, ; zuchongzhi56qubit, ; zuchongzhi60qubit, ), but also serves as a significant catalyst for progress in classical quantum circuit simulation.

Refer to caption
Figure 1. Performance of implementations of sampling the Sycamore circuit. The horizontal axis donates the time-to-solution, and the vertical axis donates the energy consumption in the quantum experiment or classical simulations. Circles and squares correspond to classical simulations and quantum experiments, respectively. The hollow circle indicates a correlated sampling loophole in the corresponding classical simulation. The region characterized by misty rose demonstrates superior results in terms of time and energy consumption.

The overarching challenge in simulating large-scale quantum circuit is its exponential time complexity, namely, computational cost for quantum circuit simulation increasing exponentially with respect to the size of the quantum circuit (bertels2021quantum, ; zlokapa2023boundaries, ; villalonga2020establishing, ; haner20175, ; wu2019full, ). To tackle this challenge, a myriad of methods (Alibaba_19days, ; Sunway_304s, ; liu2022validating, ; guo2019general, ; li2019quantum, ), delineated in Fig. 1, have focused on breaking down this large-scale simulation into smaller sub-networks that can be run on a classical super computer, while effectively reducing memory usage from petabytes (PB) to terabytes (TB) or even gigabytes (GB). However, samples generated in (Sunway_304s, ) exhibit significant correlations among them, thus not considered as faithfully simulating the Sycamore quantum processor. This issue has been addressed in subsequent tensor network researches (512GPUs_15h, ; 60GPUs_5days, ). To reduce the computational complexity, (leapfrogging, ; post-process, ) have developed a post-processing (aka, post-selection) algorithm, where they selected k𝑘kitalic_k bitstrings with the top-k𝑘kitalic_k probabilities from N𝑁Nitalic_N bitstrings. Thereby they enhanced the cross entropy benchmark (XEB) value by a factor of ln(k/N)𝑘𝑁\ln\left({k}/{N}\right)roman_ln ( italic_k / italic_N ). The computational complexity incurred by calculating the probabilities of all samples within any correlated subspace (bitstrings that share some bits) is remarkably low. Post-processing means that computing 3 million independent correlated subspaces, each containing thousands of samples, and subsequently selecting the sample with the highest probability from each correlated subspace. This method not only generates uncorrelated samples but also significantly boosts the XEB value by an order of magnitude.

Although tensor network augmented with slicing techniques has demonstrated its advantages in large-scale simulations, the decomposition of quantum circuits into sub-networks typically results in an explosive growth in the computational cost, owing to the heavy overhead from redundant calculations (Lifetime-Based, ; brandhofer2023optimal, ; leapfrogging, ).

As shown in Fig. 2, the computational complexity of the optimal contraction path exhibits an approximate inversely proportional relationship with the maximum memory size. This motivates us to explore the possibilities of minimizing the total subtasks’ energy cost through a large-scale distributed-memory systems. However, this approach poses a predicament: when the memory usage exceeds the capacity of a single node, there will be vast costs associated with inter-node communication. The inter-node data exchange not only incurs latency but also severely hinder overall system performance and energy efficiency.

Refer to caption
Figure 2. The relationship between spatial complexity and temporal complexity. Given a certain amount of memory limits, (a) shows the minimal time complexity of contraction paths where red and green hollow pentagrams represent chosen optimal solutions, 4TB and 32 TB, respectively. (b) draws time complexity distributions of multiple contraction paths searched by simulated annealing under various limited memory sizes ranging from 64GB to 2PB. For each memory constraint, we took its minimum time complexity value in (b) as its optimal contraction path, corresponding to a point with the same color in (a).

Facing the aforementioned challenges, we are intended to surpass the capabilities of present-day quantum processors, such as Sycamore, in terms of both speed and power consumption. Our major goals are twofold:

  • Firstly, achieving order of magnitude reduction in computation time marks a major breakthrough in the landscape of quantum computing. It transcends mere acceleration of the quantum circuit simulation; rather, it represents a fundamental shift in quantum computing where classical computers can not only keep up but also outperform quantum computers even in a computational task, such as random circuit sampling. Leading-edge supercomputers not only provide a fast and scalable simulator for increasingly intricate quantum circuits, but also potentially catalyze the development of wild-range fields, including cryptography and pharmacological research. This breakthrough could play a pivotal role in fostering environments where the mysteries of the once esoteric quantum realm are unveiled, pushing the frontiers of computational capability.

  • Secondly, energy consumption is essential in the realm of quantum computing. Some estimates suggest that quantum technology could reduce energy usage by a factor of 100 to 1000 by processing complex computations much more efficiently. Nonetheless, the total energy efficiency of quantum computing remains an open question, and low-energy solutions requires further exploration. The quest for energy superiority in circuit simulation, which was once the sole province of quantum processors, can demonstrate the expanding capabilities of supercomputers to solve intricate problems and without relying on an increase in energy. Furthermore, reducing energy consumption is in concert with global commitments to sustainability and environmental guardianship, fostering the creation of green computing frameworks. In conclusion, an energy-efficient quantum circuit simulator is crucial for ongoing cost-effective analysis of quantum computing and facilitation in driving towards sustainable development in real-world challenges.

Given the above considerations, we proposed the following system-level techniques encompassing algorithmic efficiency, parallel architecture, and precision control: (1) a three-level scheme tailored for large-scale contraction networks to fully leverage distributed-memory and boost energy-efficiency; (2) a hybrid communication strategy to maximize intra-node bandwidth utilization and alleviate inter-node data transfer, as well as low-precision quantization to reduce data volume fourfold; (3) an efficient Einsum extension for complex-half data, reducing memory requirements by half and leveraging high-speed fp16 tensor core computation. Furthermore, we adopt some additional optimizations, such as an optimized padding scheme that avoids heavy copy operations for large tensors and a recomputation technique which reduces the required number of nodes per sub-task by half in a 4T sub-task, leading to more efficient computing.

Our three-level parallelism strategy enables scalability in both the size of the tensor network and the number of GPUs utilized for executing parallel and independent sub-tasks. In our experiments, we tested two varying sizes of tensor networks: a 4TB and a 32TB tensor network (quantified in the complex-float format), with or without the post-processing technique proposed in (leapfrogging, )(post-process, ). This demonstrated close-to-linear scaling using up to 2,304 A100 GPUs.

We have achieved a time-to-solution of 14.22 seconds with an energy consumption of 2.39 kwh with fidelity of 0.002. Our top-tier solution comprises a 32T Tensor Network (green hollow pentagram in Fig.2) incorporating post-processing technique, which impressively cuts the time-to-solution for sampling 3×1063superscript1063\times 10^{6}3 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT bitstrings to just 17.18 seconds. This represents a significant reduction compared to Sycamore’s 600 seconds. Additionally, it consumes a mere 0.29 kWh of energy, markedly lower than Sycamore’s energy expenditure of 4.3 kWh (Quantum_supremacy, ). Up to our best knowledge, this achievement is the first to demonstrate a clear advantage over Sycamore in terms of energy consumption.

2. Current state of the art

2.1. Quantum circuits

Quantum circuits are constructed from qubits interconnected through a sequence of quantum gates. While classical circuits utilize high and low voltage levels to represent binary states, quantum circuits leverage quantum superposition. The interactions and manipulations of quantum states are dictated by unitary operations known as quantum gates. To acquire information embedded within quantum states, measurements are undertaken. The outcome of a measurement is a collapsed 0–1 bitstring in the chosen measurement basis (pan2023efficient, ). It is important to note that measurements annihilate the superposition of quantum states, necessitating the repetition of the entire experiment to gather more insights into the quantum state in question. Here, we provide a visual illustration of Google’s Sycamore (Quantum_supremacy, ) quantum circuit to clearly showcase the circuit architecture along with quantum gates.

Refer to caption
Figure 3. Example quantum circuit instance.

Fig. 3 provides a portion of a 5-qubit quantum circuit example, within which each random quantum circuits (RQCs) consists of a series of m𝑚mitalic_m full cycles followed by a half cycle ending with the measurement of all qubits. Each full cycle involves two steps: First, a single-qubit gate is applied to each qubit. Next, two-qubit gates are applied to pairs of qubits, with different qubit pairs being allowed to interact in different cycles. The RQC sequence is repeated in subsequent cycles. The half cycle preceding the measurement exclusively comprises single-qubit gates.

In Google’s Sycamore processor, they configure three distinct single-qubit gates, each of which represents a π/2𝜋2\pi/2italic_π / 2-rotation about an axis situated on the equator of the Bloch sphere. Disregarding the global phase, these gates can be represented as follows: X=12[1ii1]𝑋12delimited-[]1𝑖𝑖1\sqrt{X}=\frac{1}{\sqrt{2}}\left[\begin{array}[]{cc}1&-i\\ -i&1\end{array}\right]square-root start_ARG italic_X end_ARG = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG [ start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL - italic_i end_CELL end_ROW start_ROW start_CELL - italic_i end_CELL start_CELL 1 end_CELL end_ROW end_ARRAY ], Y=12[1111]𝑌12delimited-[]1111\sqrt{Y}=\frac{1}{\sqrt{2}}\left[\begin{array}[]{cc}1&-1\\ 1&1\end{array}\right]square-root start_ARG italic_Y end_ARG = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG [ start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARRAY ], W=12[1ii1]𝑊12delimited-[]1𝑖𝑖1\sqrt{W}=\frac{1}{\sqrt{2}}\left[\begin{array}[]{cc}1&-\sqrt{i}\\ \sqrt{-i}&1\end{array}\right]square-root start_ARG italic_W end_ARG = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG [ start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL - square-root start_ARG italic_i end_ARG end_CELL end_ROW start_ROW start_CELL square-root start_ARG - italic_i end_ARG end_CELL start_CELL 1 end_CELL end_ROW end_ARRAY ], and two-qubit gates are determined by the cycle index:

𝐟𝐒𝐢𝐦(θ,ϕ)=[10000cos(θ)isin(θ)00isin(θ)cos(θ)0000eiϕ],𝐟𝐒𝐢𝐦𝜃italic-ϕdelimited-[]10000𝜃𝑖𝜃00𝑖𝜃𝜃0000superscript𝑒𝑖italic-ϕ\mathbf{fSim}(\theta,\phi)=\left[\begin{array}[]{cccc}1&0&0&0\\ 0&\cos(\theta)&-i\sin(\theta)&0\\ 0&-i\sin(\theta)&\cos(\theta)&0\\ 0&0&0&e^{-i\phi}\end{array}\right],bold_fSim ( italic_θ , italic_ϕ ) = [ start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_cos ( italic_θ ) end_CELL start_CELL - italic_i roman_sin ( italic_θ ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - italic_i roman_sin ( italic_θ ) end_CELL start_CELL roman_cos ( italic_θ ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL italic_e start_POSTSUPERSCRIPT - italic_i italic_ϕ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] ,

where the parameters θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ of the 𝐟𝐒𝐢𝐦𝐟𝐒𝐢𝐦\mathbf{fSim}bold_fSim matrix are determined by the qubit pairing.

2.2. RQC simulation methods

RQC simulations on classical computers generally fall into two main categories (Sunway_304s, ). A traditional approach to simulating the evolution of the quantum state is the state vector method (de2019massively, ), which tracks the evolution of quantum states using a state vector and time evolution described by the Schrödinger equation. Nevertheless, this would necessitate exponentially large memory space for an n𝑛nitalic_n-qubit system (vidal2003efficient, ), thereby limiting RQC simulations of large numbers of qubits and high entanglement (state_vector, ).

An alternative method is to use tensor networks, where an n𝑛nitalic_n-qubit quantum state can be formulated as a rank-n𝑛nitalic_n tensor with 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT complex number entries. Single-qubit and two-qubit quantum gate operations can be represented by rank-2222 and rank-4444 tensors, respectively. Consequently, RQC can be converted into tensor networks, which can then be contracted to yield specific final state bitstring amplitudes (villalonga2019flexible, ; chen2018classical, ).

Over the past four years, the simulation of Sycamore’s RQC has gradually converged into tensor network algorithms. Significant effort has been devoted to two key points: finding the optimal contraction order and streamlining the contraction processes to achieve high-performance calculations.

A powerful parallel algorithm for the contraction of tensor networks is stem optimization (Alibaba_19days, ), which involves a path of contractions that dominates the overall computational cost. The tensor-based algorithm dynamically decomposes the computational task into numerous smaller subtasks that can be executed in parallel on a cluster. Although (Alibaba_19days, ) demonstrated improvements over conventional methods, these subtasks still involved intense computations within a single node. (Sunway_304s, ) introduced an optimized slicing strategy that reduced the space complexity of sliced tensors and alleviated embarrassing parallelism among subtasks. This strategy, combined with better contraction paths and custom optimizations, allowed for more efficient computing performance than previous methods, such as those provided by the CoTenGra software (cotengra, ).

Pan et al. (512GPUs_15h, ) considered quantum circuits as a 3333-dimensional tensor network, from which they broke edges to form the sub-network. This process can be visualized as drilling holes in the 3D graphical representation of the network. They introduced a method for sparse-state tensor contraction, which allowed for calculating amplitudes of many uncorrelated bitstrings in a single step. This approach is notably more efficient for producing numerous uncorrelated samples than previous techniques.

Furthermore, (leapfrogging, ) presented a post-processing/post-selection technique where they calculated the probability distribution for each string accurately and obtained top k strings with the largest output probabilities. Through implementing this approach, it was found that it was necessary to execute merely 0.03% of the total tasks (i.e., contract 0.03% sub-networks of total networks) out of a total of 224superscript2242^{24}2 start_POSTSUPERSCRIPT 24 end_POSTSUPERSCRIPT subtasks to reach an XEB value of 0.002. As a result, they significantly improved the cross entropy benchmarking (XEB) values and achieved a groundbreaking faster solution, which has motivated the methodology of this work.

2.3. Classical versus quantum: tensor-network-based experiments

Numerous simulations of random quantum circuits have been specifically designed to bridge the quantum-classical simulation divide. Based on tensor network algorithms, ZuChongzhi 2.0 (zuchongzhi56qubit, ) and Zuchongzhi 2.1 (zuchongzhi60qubit, ) reached up to a simulation scale of 56-qubits 20-cycles and 60-qubit 24-cycle, respectively. Specifically, Zuchongzhi 2.1 required an estimated 1.63×10181.63superscript10181.63\times 10^{18}1.63 × 10 start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT floating point operations to generate one perfect sample. The random circuit sampling task was completed by Zuchongzhi 2.1 in 4.2 hours and achieved XEB fidelity of (3.66±0.345)×104plus-or-minus3.660.345superscript104(3.66\pm 0.345)\times 10^{-4}( 3.66 ± 0.345 ) × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.

Alibaba presented a tensor network-based classical simulation algorithm (Alibaba_19days, ). Through runtime testing of subtasks, it was estimated that the Sycamore task (53 qubits, 20 cycles circuit with a fidelity of 0.2%) can be accomplished in 19.3 days using the Summit cluster. Xin Liu et al. (liu2021redefining, ) employed contengra to determine a near-optimal tensor contraction order for computational purposes. Consequently, they produced 2,000 perfect samples (or an equivalent of 1 million samples with 0.2% fidelity) within a span of 6.4 days on the latest generation of Sunway’s supercomputer. Yong Liu et al. developed an RQC simulator on the advanced Sunway supercomputer. They introduced an optimized path strategy based on tensor network simulation to balance the memory requirements and the number of concurrent computations, resulting in a remarkable achievement of sampling 1 million correlated samples of the Google Sycamore RQC task in just 304 seconds (Sunway_304s, ).

Targeting Google’s Sycamore RQC problem, Pan et al. proposed the big-head approach(60GPUs_5days, ), using 60 GPUs for 5 days, they generated 1×1061superscript1061\times 10^{6}1 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT correlated samples with XEB 0.739, and passed the XEB test. Another tensor network method (512GPUs_15h, ) was designed where 1×1061superscript1061\times 10^{6}1 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT uncorrelated bitstrings achieved fidelity F0.0037𝐹0.0037F\approx 0.0037italic_F ≈ 0.0037 in a time-cost of 15 hours on a computational cluster with 512 GPUs. Pan et al. also proposed an optimization strategy that transfers the most time-consuming component of the sparse-state tensor network simulation from sparse Einstein summation (einsum) to matrix multiply operations (Alibaba_19days, ). This optimization was applied to the simulation of 53-qubit, 20-cycle Sycamore circuits, using 2819 A100 GPU hours to verify three million sampled bitstrings.

A recent work (leapfrogging, ) developed a post-processing algorithm aimed at enhancing XEB values. This algorithm, when implemented with 1,432 NVIDIA A100 GPUs, was capable of yielding 3 million uncorrelated samples with XEB values of 2×1032superscript1032\times 10^{-3}2 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT in a duration of 86.4 seconds, while consuming 13.7 kWh of electricity. To further trade between spatial complexity and temporal complexity, we extend the work (leapfrogging, ) through examining the contraction path’s time complexity under predetermined memory limits. While Fig. 2 (b) presents the frequency distribution of all the contraction paths’ time complexities under memory sizes from 64GB to 2PB, each point in Fig. 2 (a) illustrates the optimal contraction path with the least time complexity within a certain memory constraints. As the available memory increases by a factor of 8888, the time complexity of the optimal contraction path initially decreases rapidly and then gradually diminishes, eventually converging beyond 32TB memory limits. This observation highlights the potential of harnessing more memory resources for faster computing.

These RQC simulation described above has made significant progress in estimated or actual computation time, where the gap with Sycamore is no longer substantial. Nonetheless, even with the state-of-the-art approach, there remains an order of magnitude energy efficiency gap between classical computers and quantum computers in the sampling of RQC tasks. As (leapfrogging, ) has solidified the outstanding efficiency of tensor network calculations, our subsequent sections will generate uncorrelated samples based on the optimal solutions for 4TB and 32TB tensor network (marked as hollow pentagrams with red and blue color in Fig. 2 (a)).

3. Innovations

The optimal contraction path searching and edge splitting algorithms for the tensor network are based on (512GPUs_15h, ). In these methodologies, we employ the technique of drilling holes or breaking edges in the original 3D tensor network to decrease the complexity of a sub-task. As explained in Subsection 2.3, leveraging spatial capacity of tensor networks can greatly reduce the computational complexity, for example 4TB and 32TB tensor network. However, scaling up the tensor network introduces communication overhead, posing a new bottleneck in inter-node communication within the GPU cluster.

To address this challenge, this section proposes various solutions to mitigate inter-node communication issues and harness the potential of the GPU cluster through a three-level parallel scheme embedded with hybrid communication algorithm. Furthermore, we optimize computing resources by converting complex single-precision calculations into complex half-precision calculations. Additionally, we present tailored remedies for specific scenarios encountered in our application domain.

3.1. Three-level parallelization scheme

Refer to caption
Figure 4. Architectural Overview and Example of Parallel Scheme. (a) Overview of the three-level parallel scheme: the task commences at the global level, then the tensor network is partitioned into parallel, independent sub-networks. Data is subsequently segmented across nodes within a multi-node level, interconnected through InfiniBand. Finally, the data is divided into sizes compatible with individual devices within each node, connected via high-bandwidth NVLink. (b) Example of 2-Node-4-Device Communication: we exhibit a subtask that encompasses two nodes, each hosting two devices, and demonstrate the data permutation occurring across both inter-node and intra-node levels.

We categorize our architecture into three hierarchical levels as depicted in Fig. 4 (a): global level, multi-node level, and device level.

Consistent with the approaches described by (512GPUs_15h, ), after identifying a near-optimal path, we initially dissect the contraction of the overall tensor network into independent sub-networks 𝑻𝑵𝑻𝑵\bm{TN}bold_italic_T bold_italic_N (eg., 4T 𝑻𝑵𝑻𝑵\bm{TN}bold_italic_T bold_italic_N and 32T 𝑻𝑵𝑻𝑵\bm{TN}bold_italic_T bold_italic_N), which can be executed on separate multi-nodes, labeled as the multi-node level. Each sub-task, corresponding to the contraction of an independently sliced tensor, is allocated to the multi-node level (indicated by the solid orange rectangle in Fig. 4). Herein, each multi-node level task encompasses the entire sub-network contraction process and is distributed across multiple interconnected nodes via an InfiniBand (IB) network.

In the context of the tensor network, the term ”stem path” refers to a sequence of expensive nodes that dominate the overall computation and memory cost (Alibaba_19days, ), where ”stem tensor” is the tensor associated with the stem path.

Given a rank-n stem tensor Tsmultinode(α0,α1,,αn)superscriptsubscript𝑇𝑠𝑚𝑢𝑙𝑡𝑖𝑛𝑜𝑑𝑒subscript𝛼0subscript𝛼1subscript𝛼𝑛T_{s}^{multi-node}(\alpha_{0},\alpha_{1},...,\alpha_{n})italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_u italic_l italic_t italic_i - italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), each of whose rank having a dimension of 2, the first Nintersubscript𝑁𝑖𝑛𝑡𝑒𝑟N_{inter}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ranks (also referred to as modes) signifies further division of the tensor into 2Nintersuperscript2subscript𝑁𝑖𝑛𝑡𝑒𝑟2^{N_{inter}}2 start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT segments. After sliced into a single node, the stem tensor converts to Tsnode(αNinter,,αNinter+Nintra,,αn)superscriptsubscript𝑇𝑠𝑛𝑜𝑑𝑒subscript𝛼subscript𝑁𝑖𝑛𝑡𝑒𝑟subscript𝛼subscript𝑁𝑖𝑛𝑡𝑒𝑟subscript𝑁𝑖𝑛𝑡𝑟𝑎subscript𝛼𝑛T_{s}^{node}(\alpha_{N_{inter}},...,\alpha_{N_{inter}+N_{intra}},...,\alpha_{n})italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

The final level of parallelization among disparate devices, exemplified by the dashed green rectangle in Fig. 4 involves specific matrix multiplication and index permutation operations executed on each GPU, interconnected through a high-bandwidth NVLink. Here, stem tensor is further segmented into a reasonable size that can fit into a single device by trimming the subsequent Nintrasubscript𝑁𝑖𝑛𝑡𝑟𝑎N_{intra}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT modes, thereby yielding Tsdevice(αNinter+Nintra,,αn)superscriptsubscript𝑇𝑠𝑑𝑒𝑣𝑖𝑐𝑒subscript𝛼subscript𝑁𝑖𝑛𝑡𝑒𝑟subscript𝑁𝑖𝑛𝑡𝑟𝑎subscript𝛼𝑛T_{s}^{device}(\alpha_{N_{inter}+N_{intra}},...,\alpha_{n})italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_v italic_i italic_c italic_e end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

Before launching the final network, we introduce a pre-processing step, which involves a hybrid communication algorithm that determine Nintersubscript𝑁𝑖𝑛𝑡𝑒𝑟N_{inter}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT and 2Nintrasuperscript2subscript𝑁𝑖𝑛𝑡𝑟𝑎2^{N_{intra}}2 start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT mode according to the bandwidth of different storage levels.

As shown in Fig. 4, in our classical computational setup, GPUs within a node are connected via NVLink, which has a bandwidth of 300 GB/s. Nodes are connected through InfiniBand, which in our case operates at a bandwidth of 100 GB/s. These IB links are shared by 8 GPUs. As a result, inter-node communication is one order of magnitude slower than intra-node communication.

Our approach divides the workload between intra-node and inter-node communication. This approach takes advantage of the higher bandwidth provided by NVLink, thereby optimizing performance overall. The hybrid communication algorithm that combines both intra-node and inter-node communication for each step is shown in Algo 1.

init : sub-network consists of initial tensors and the optimal contraction path whose stem is {ein[i]}isubscript𝑒𝑖𝑛delimited-[]𝑖𝑖\{ein[i]\}_{i\in\mathcal{I}}{ italic_e italic_i italic_n [ italic_i ] } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT
1 According to storage levels, determine Nintersubscript𝑁𝑖𝑛𝑡𝑒𝑟N_{inter}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT modes and Nintrasubscript𝑁𝑖𝑛𝑡𝑟𝑎N_{intra}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT modes
2 for i𝑖iitalic_i in \mathcal{I}caligraphic_I do
3      currEin𝑐𝑢𝑟𝑟𝐸𝑖𝑛currEinitalic_c italic_u italic_r italic_r italic_E italic_i italic_n = ein[i]𝑒𝑖𝑛delimited-[]𝑖ein[i]italic_e italic_i italic_n [ italic_i ]
4      if first-Nintersubscript𝑁𝑖𝑛𝑡𝑒𝑟N_{inter}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT modes contracted then
5           Inter-node communication
6          
7           else
8                if subsequent Nintrasubscript𝑁𝑖𝑛𝑡𝑟𝑎N_{intra}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT modes contracted then
9                     Intra-node communication
10                    
11                    
12                     end if
13                    performer computation of currEin𝑐𝑢𝑟𝑟𝐸𝑖𝑛currEinitalic_c italic_u italic_r italic_r italic_E italic_i italic_n
14                    
15                     end for
16                    
Algorithm 1 Hybrid communication algorithm

Fig. 4 (b) illustrates an example of the rearrangement pattern wherein both Nintrasubscript𝑁𝑖𝑛𝑡𝑟𝑎N_{intra}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT and Nintersubscript𝑁𝑖𝑛𝑡𝑒𝑟N_{inter}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT are equal to 1. In this scenario, a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is related to the inter mode, while a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is associated with the intra mode. During a single tensor contraction, no inter-node communication is required if the first-Nintersubscript𝑁𝑖𝑛𝑡𝑒𝑟N_{inter}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT modes remain uncontracted. However, if any of the following Nintrasubscript𝑁𝑖𝑛𝑡𝑟𝑎N_{intra}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT modes are contracted, we rearrange the tensor by swapping the next Nintrasubscript𝑁𝑖𝑛𝑡𝑟𝑎N_{intra}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT modes with the following Nintrasubscript𝑁𝑖𝑛𝑡𝑟𝑎N_{intra}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT and distribute the tensor based on the new Nintrasubscript𝑁𝑖𝑛𝑡𝑟𝑎N_{intra}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT modes. When some of the first-Nintersubscript𝑁𝑖𝑛𝑡𝑒𝑟N_{inter}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT modes are contracted, we rearrange the tensor by swapping the first-Nintersubscript𝑁𝑖𝑛𝑡𝑒𝑟N_{inter}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT modes with the next Nintersubscript𝑁𝑖𝑛𝑡𝑒𝑟N_{inter}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT and distribute it accordingly based on the new Nintersubscript𝑁𝑖𝑛𝑡𝑒𝑟N_{inter}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT modes.

3.2. Customized Low-Precision Communication

When the first-Nintersubscript𝑁𝑖𝑛𝑡𝑒𝑟N_{inter}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT modes are contracted, we need to perform an all-to-all communication between multiple nodes as shown in Fig. 4, serving as a dominant factor in both time and energy usage. For example, in a 4T tensor network, the inter-node communication time accounts for up to 60.0%, while energy consumption is around 35%. Consequently, the time and energy consumption of inter-node communication become bottlenecks. Here we provide low-precision quantization skills to alleviate the cost of inter-node communication.

In low-precision quantization, utilizing a single scale and zero offset for the whole tensor could result in substantial computational losses. Therefore, quantizers will divide the tensor into multiple groups, and apply quantization to each group, whose tensor is referred to as group tensor. According to classic tensor quantization methods (Vectorquantization, ), we apply the following general formulation for quantization operator 𝒬𝒬\mathcal{Q}caligraphic_Q applied to the i𝑖iitalic_i-th group of tensor 𝑻𝑻\bm{T}bold_italic_T:

(1) 𝒬([𝑻]i)=[𝑻]iexp×scale+zero,𝒬subscriptdelimited-[]𝑻𝑖superscriptsubscriptdelimited-[]𝑻𝑖expscalezero\mathcal{Q}\left([\bm{T}]_{i}\right)=[\bm{T}]_{i}^{\text{exp}}\times\text{% scale}+\text{zero},caligraphic_Q ( [ bold_italic_T ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = [ bold_italic_T ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT exp end_POSTSUPERSCRIPT × scale + zero ,

where scale=qmaxqminmax([𝑻]i)min([𝑻]i)scalesubscript𝑞𝑚𝑎𝑥subscript𝑞𝑚𝑖𝑛subscriptdelimited-[]𝑻𝑖subscriptdelimited-[]𝑻𝑖\text{scale}=\frac{q_{max}-q_{min}}{\max\left([\bm{T}]_{i}\right)-\min\left([% \bm{T}]_{i}\right)}scale = divide start_ARG italic_q start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG start_ARG roman_max ( [ bold_italic_T ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_min ( [ bold_italic_T ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG is the scale factor, qmaxsubscript𝑞𝑚𝑎𝑥q_{max}italic_q start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and qminsubscript𝑞𝑚𝑖𝑛q_{min}italic_q start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT are the range of quantization value we map, and max([𝑻]i)subscriptdelimited-[]𝑻𝑖\max\left([\bm{T}]_{i}\right)roman_max ( [ bold_italic_T ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and min([𝑻]i)subscriptdelimited-[]𝑻𝑖\min\left([\bm{T}]_{i}\right)roman_min ( [ bold_italic_T ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are the maximum and minimum values of the group tensor [𝑻]isubscriptdelimited-[]𝑻𝑖[\bm{T}]_{i}[ bold_italic_T ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The zero-point is zero=qminmax([𝑻]i)qmaxmin([𝑻]i)max([𝑻]i)min([𝑻]i)zerosubscript𝑞𝑚𝑖𝑛subscriptdelimited-[]𝑻𝑖subscript𝑞𝑚𝑎𝑥subscriptdelimited-[]𝑻𝑖subscriptdelimited-[]𝑻𝑖subscriptdelimited-[]𝑻𝑖\text{zero}=\frac{q_{min}\max\left([\bm{T}]_{i}\right)-q_{max}\min\left([\bm{T% }]_{i}\right)}{\max\left([\bm{T}]_{i}\right)-\min\left([\bm{T}]_{i}\right)}zero = divide start_ARG italic_q start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT roman_max ( [ bold_italic_T ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_q start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT roman_min ( [ bold_italic_T ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG roman_max ( [ bold_italic_T ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_min ( [ bold_italic_T ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG, and exp is an optional parameter to perform exponential non-linear operation.

Based on the above equation, we provide three types of quantization: float2half, float2int8, and float2int4 (see Table 1). The table displays the optimal quantization parameters for each type. For half and int8 quantization, we compute a global scale and zero-point factor for the entire tensor. In contrast, int4 quantization requires a more granular quantization within each group of the tensor rather than the entire tensor, which significantly minimizes fidelity loss (GDRQ, ). As different group has different parameters, smaller group sizes result in better fidelity due to tailored scaling and zero-points, but leads to communication overhead. Additionally, rounding is incorporated when performing int-quantization.

Table 1. Refined quantitative parameters.
Type Range Exp Group Round
float ±3.4×1038plus-or-minus3.4superscript1038\pm 3.4\times 10^{38}± 3.4 × 10 start_POSTSUPERSCRIPT 38 end_POSTSUPERSCRIPT
float2half ±6.65×104plus-or-minus6.65superscript104\pm 6.65\times 10^{4}± 6.65 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT 1 entire tensor false
float2int8 128127similar-to128127-128\sim 127- 128 ∼ 127 0.2 entire tensor true
float2int4 015similar-to0150\sim 150 ∼ 15 1 group tensor true

We further enhance performance by crafting custom kernels for all the quantization type above. Dealing with complex-value data, we optimize the kernels through vectorized memory access and fine-tuning for achieving maximum bandwidth utilization and reducing latency. Finally, by adopting int4 quantization kernel, the communication time is decreased by over 85% with minimal fidelity loss (compared with float), proving an advantage for various cases.

3.3. Expanding Einsum Paradigm for Complex-Half Precision

Benefiting from the absence of uncontrollable noise, complex-half precision can be adopted in classical quantum simulation to minimize the memory demand. However, complex-half extensions are not directly applicable. We delve into the einsum paradigm and extend it for complex-half precision support.

Assume that there are three tensors A,B,𝐴𝐵A,B,italic_A , italic_B , and C𝐶Citalic_C, whose ranks (referred to as modes in the context of einsum equation) are NA,NB,subscript𝑁𝐴subscript𝑁𝐵N_{A},N_{B},italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , and NCsubscript𝑁𝐶N_{C}italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, respectively. Let Nreducesubscript𝑁𝑟𝑒𝑑𝑢𝑐𝑒N_{reduce}italic_N start_POSTSUBSCRIPT italic_r italic_e italic_d italic_u italic_c italic_e end_POSTSUBSCRIPT be the number of reduction indices. A general einsum equation

(2) α1α2αNA,β1β2βNB>γ1γ2γNC,subscript𝛼1subscript𝛼2subscript𝛼subscript𝑁𝐴limit-fromsubscript𝛽1subscript𝛽2subscript𝛽subscript𝑁𝐵subscript𝛾1subscript𝛾2subscript𝛾subscript𝑁𝐶\alpha_{1}\alpha_{2}\ldots\alpha_{N_{A}},\beta_{1}\beta_{2}\ldots\beta_{N_{B}}% ->\gamma_{1}\gamma_{2}\ldots\gamma_{N_{C}},italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_β start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT - > italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_γ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,

can be expressed as a superposition as follows:

(3) Cγ1γ2γNC=δ1δ2δNreduceAα1α2αNABβ1β2βNB,subscript𝐶subscript𝛾1subscript𝛾2subscript𝛾subscript𝑁𝐶subscriptsubscript𝛿1subscriptsubscript𝛿2subscriptsubscript𝛿subscript𝑁𝑟𝑒𝑑𝑢𝑐𝑒subscript𝐴subscript𝛼1subscript𝛼2subscript𝛼subscript𝑁𝐴subscript𝐵subscript𝛽1subscript𝛽2subscript𝛽subscript𝑁𝐵C_{\gamma_{1}\gamma_{2}\ldots\gamma_{N_{C}}}=\sum_{\delta_{1}}\sum_{\delta_{2}% }\ldots\sum_{\delta_{N_{reduce}}}A_{\alpha_{1}\alpha_{2}\ldots\alpha_{N_{A}}}B% _{\beta_{1}\beta_{2}\ldots\beta_{N_{B}}},italic_C start_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_γ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … ∑ start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_r italic_e italic_d italic_u italic_c italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_β start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,

where δ1,δ2,,δNreducesubscript𝛿1subscript𝛿2subscript𝛿subscript𝑁𝑟𝑒𝑑𝑢𝑐𝑒\delta_{1},\delta_{2},\ldots,\delta_{N_{reduce}}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_δ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_r italic_e italic_d italic_u italic_c italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the indices to be reduced. It is crucial to recognize that the formula can only be transformed into a pure General Matrix Multiply (GEMM) operation, not a combination of GEMM and reduction, when the following condition is met: δ1,δ2,,δNreduce=(α1,α2,,αNA)(β1,β2,,βNB)subscript𝛿1subscript𝛿2subscript𝛿subscript𝑁𝑟𝑒𝑑𝑢𝑐𝑒subscript𝛼1subscript𝛼2subscript𝛼subscript𝑁𝐴subscript𝛽1subscript𝛽2subscript𝛽subscript𝑁𝐵\delta_{1},\delta_{2},\ldots,\delta_{N_{reduce}}=(\alpha_{1},\alpha_{2},\ldots% ,\alpha_{N_{A}})\cap(\beta_{1},\beta_{2},\ldots,\beta_{N_{B}})italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_δ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_r italic_e italic_d italic_u italic_c italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∩ ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). Moreover,

(4) γ1,,γNC=(α1,,αNA)(β1,,βNB)\(δ1,,δNreduce)subscript𝛾1subscript𝛾subscript𝑁𝐶subscript𝛼1subscript𝛼subscript𝑁𝐴\subscript𝛽1subscript𝛽subscript𝑁𝐵subscript𝛿1subscript𝛿subscript𝑁𝑟𝑒𝑑𝑢𝑐𝑒\gamma_{1},\ldots,\gamma_{N_{C}}=(\alpha_{1},\ldots,\alpha_{N_{A}})\cup(\beta_% {1},\ldots,\beta_{N_{B}})\backslash(\delta_{1},\ldots,\delta_{N_{reduce}})italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∪ ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) \ ( italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_δ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_r italic_e italic_d italic_u italic_c italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT )

are the remaining indices for both tensors A𝐴Aitalic_A and B𝐵Bitalic_B.

To optimize space complexity and enhance calculation speed, a straightforward approach could involve swapping single-precision floating-point numbers with half-precision ones. However, we encountered a lack of support for contracting complex half-precision numbers within high-performance computing libraries. This deficiency persists even in well-known libraries. Some libraries like PyTorch support complex half-precision calculations by splitting into real and imaginary parts, which are inefficient due to multiple reads/writes and handling discontinuous data. We introduce a new solution to these challenges.

To enhance readability, we generally denote A𝐴Aitalic_A as the larger input, B𝐵Bitalic_B as the smaller one, and C𝐶Citalic_C as the output. In this case, B𝐵Bitalic_B’s IO operation is often negligible compared to that of A𝐴Aitalic_A and C𝐶Citalic_C, so our primary focus is on data access for A𝐴Aitalic_A and C𝐶Citalic_C.

A straightforward attempt is to simply append a mode that indicates whether it represents the real or imaginary part at the end of each component in Equation 2, as shown in Equation 5. However, this approach is incorrect since αNA+1subscript𝛼subscript𝑁𝐴1\alpha_{N_{A+1}}italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_A + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and βNB+1subscript𝛽subscript𝑁𝐵1\beta_{N_{B+1}}italic_β start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_B + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the indices to be reduced, and γNC+1subscript𝛾subscript𝑁𝐶1\gamma_{N_{C+1}}italic_γ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the remaining index, we do not have any index generating γNC+1subscript𝛾subscript𝑁𝐶1\gamma_{N_{C+1}}italic_γ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT in either of the inputs, which necessitates a modification of Equation 5:

(5) α1αNAαNA+1,β1βNBβNB+1>γ1γNCγNC+1.subscript𝛼1subscript𝛼subscript𝑁𝐴subscript𝛼subscript𝑁𝐴1limit-fromsubscript𝛽1subscript𝛽subscript𝑁𝐵subscript𝛽subscript𝑁𝐵1subscript𝛾1subscript𝛾subscript𝑁𝐶subscript𝛾subscript𝑁𝐶1\alpha_{1}\ldots\alpha_{N_{A}}\alpha_{N_{A+1}},\beta_{1}\ldots\beta_{N_{B}}% \beta_{N_{B+1}}->\gamma_{1}\ldots\gamma_{N_{C}}\gamma_{N_{C+1}}.italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_A + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_β start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_B + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - > italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_γ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

Since introducing a supplementary mode necessitates duplicating the size in bytes, the most efficient approach is to include an extra mode for tensor B, as it is smaller than tensor A. The final equation becomes:

(6) α1αNAαNA+1,γNC+1β1βNBαNA+1>γ1γNCγNC+1,subscript𝛼1subscript𝛼subscript𝑁𝐴subscript𝛼subscript𝑁𝐴1limit-fromsubscript𝛾subscript𝑁𝐶1subscript𝛽1subscript𝛽subscript𝑁𝐵subscript𝛼subscript𝑁𝐴1subscript𝛾1subscript𝛾subscript𝑁𝐶subscript𝛾subscript𝑁𝐶1\alpha_{1}\ldots\alpha_{N_{A}}\alpha_{N_{A+1}},\gamma_{N_{C+1}}\beta_{1}\ldots% \beta_{N_{B}}\alpha_{N_{A+1}}->\gamma_{1}\ldots\gamma_{N_{C}}\gamma_{N_{C+1}},italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_A + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_β start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_A + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - > italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_γ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,

where tensor B𝐵Bitalic_B is padding from [B(real,imag)]delimited-[]subscript𝐵𝑟𝑒𝑎𝑙𝑖𝑚𝑎𝑔[B_{(real,imag)}][ italic_B start_POSTSUBSCRIPT ( italic_r italic_e italic_a italic_l , italic_i italic_m italic_a italic_g ) end_POSTSUBSCRIPT ] to
[B(real,imag),B(imag,real)]subscript𝐵𝑟𝑒𝑎𝑙𝑖𝑚𝑎𝑔subscript𝐵𝑖𝑚𝑎𝑔𝑟𝑒𝑎𝑙[B_{(real,-imag)},B_{(imag,real)}][ italic_B start_POSTSUBSCRIPT ( italic_r italic_e italic_a italic_l , - italic_i italic_m italic_a italic_g ) end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT ( italic_i italic_m italic_a italic_g , italic_r italic_e italic_a italic_l ) end_POSTSUBSCRIPT ]. Here, we substitute βNB+1subscript𝛽subscript𝑁𝐵1\beta_{N_{B+1}}italic_β start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_B + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT with αNA+1subscript𝛼subscript𝑁𝐴1\alpha_{N_{A+1}}italic_α start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_A + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT because they are identical by definition from Equation 4.

Here’s an example, where we have a1a2,b1a1b1subscript𝑎1subscript𝑎2subscript𝑏1subscript𝑎1subscript𝑏1a_{1}a_{2},b_{1}\rightarrow a_{1}b_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, with a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT being [[(1+2i),(3+4i)]]delimited-[]12𝑖34𝑖[[(1+2i),(3+4i)]][ [ ( 1 + 2 italic_i ) , ( 3 + 4 italic_i ) ] ] and b1subscript𝑏1b_{1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT being [(5+6i)]delimited-[]56𝑖[(5+6i)][ ( 5 + 6 italic_i ) ]. This leads to a result of [(7+16i),(9+38i)]716𝑖938𝑖[(-7+16i),(-9+38i)][ ( - 7 + 16 italic_i ) , ( - 9 + 38 italic_i ) ]. Following the provided method, we transform the corresponding complex tensor A𝐴Aitalic_A into [[1,2],[3,4]]1234[[1,2],[3,4]][ [ 1 , 2 ] , [ 3 , 4 ] ] and rearrange tensor B𝐵Bitalic_B to [[5,6],[6,5]]5665[[5,-6],[6,5]][ [ 5 , - 6 ] , [ 6 , 5 ] ]. By applying the modified equation a1a2,c0b1a2a1b1c0subscript𝑎1subscript𝑎2subscript𝑐0subscript𝑏1subscript𝑎2subscript𝑎1subscript𝑏1subscript𝑐0a_{1}a_{2},c_{0}b_{1}a_{2}\rightarrow a_{1}b_{1}c_{0}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we obtain the same exact result, [[7,16]],[[9,38]]delimited-[]716delimited-[]938[[-7,16]],[[-9,38]][ [ - 7 , 16 ] ] , [ [ - 9 , 38 ] ].

3.4. Optimization for special cases

We introduce several specialized skills. They are not as generalized as previous introduces methods, but they are useful for certain tasks.

3.4.1. Recomputation

In the 4T tensor network, we observe the following features: there are only four steps with values above 1T, one of which is 2T; 2. During and after these four steps, there is no data communication. Based on this information, we designed recomputation techniques. The main idea is that instead of calculating the entire large tensors at a time, we compute half of them each time. Specifically, we begin at the start, just before the generation of the 1T tensor. We store and compute only half of the tensors and proceed until the end. Then, we restart from the middle, read the remaining half, and calculate the second half. Finally, we concatenate the two parts. This technique substantially reduces the necessary nodes by two, concurrently diminishing the data volume for all-to-all communication, considering that Nintersubscript𝑁𝑖𝑛𝑡𝑒𝑟N_{inter}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT is concurrently diminished by 1.

3.4.2. Tensor contraction during sparse state

As described in (512GPUs_15h, ), sparse state occurs in the final stage of tensor network computation. It is inherently discontinuous and repetitive, requiring the copying of tensors for computations. However, due to the allocation of a double-buffer, the GPU memory is nearly exhausted and cannot support the loading of large tensor calculations, so we divide the larger tensor into smaller chunks that can fit into the current GPU memory, and the number of chunks is determined by the current remaining capacity of the GPU memory. Then we compute each tensor chunk iteratively.

Refer to caption
Figure 5. The bottom part is tensor multiplication by traditionally retrieving tensors through indices. The top part is multiplication between source tensor and padding-tensor which retrieved through a 2-dimensionalizing index. AIsubscript𝐴𝐼A_{I}italic_A start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, BIsubscript𝐵𝐼B_{I}italic_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT are multiplication input tensors which are indexing from source tensors A, B. BPsubscript𝐵𝑃B_{P}italic_B start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is a padding-tensor. C and CPsubscript𝐶𝑃C_{P}italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT are different products of two multiplications.

Tensor contraction involves dimension reordering and matrix multiplication (Sunway_304s, ). Then, with the enhancement of the NVIDIA A100, we are no longer limited to the contraction between a pair of tensors. By utilizing the cuTensor library function, we can perform computations on multiple pairs of tensors simultaneously. As shown in Fig. 5, we can obtain multiple tensors at specified positions through the indices IndexA𝐼𝑛𝑑𝑒subscript𝑥𝐴Index_{A}italic_I italic_n italic_d italic_e italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and IndexB𝐼𝑛𝑑𝑒subscript𝑥𝐵Index_{B}italic_I italic_n italic_d italic_e italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT to form the large tensors AIsubscript𝐴𝐼A_{I}italic_A start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and BIsubscript𝐵𝐼B_{I}italic_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT before performing matrix multiplication. We have designed two indexing schemes to accommodate different index distribution scenarios.

Let A[a,c,d,f] and B[b,e,f] be input tensors, C[n,c,d,e] be the contraction product, all with two dimensions except the outermost ranks a, b, n, and let f be a common rank to be contracted. masubscript𝑚𝑎m_{a}italic_m start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and mbsubscript𝑚𝑏m_{b}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT respectively are chunk-size of A and B, so the mnsubscript𝑚𝑛m_{n}italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the same of C. Usually, IndexA𝐼𝑛𝑑𝑒subscript𝑥𝐴Index_{A}italic_I italic_n italic_d italic_e italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and IndexB𝐼𝑛𝑑𝑒subscript𝑥𝐵Index_{B}italic_I italic_n italic_d italic_e italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are arrays with mnsubscript𝑚𝑛m_{n}italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT integers from 0 to masubscript𝑚𝑎m_{a}italic_m start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT or mbsubscript𝑚𝑏m_{b}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, then C=AI×BI𝐶subscript𝐴𝐼subscript𝐵𝐼C=A_{I}\times B_{I}italic_C = italic_A start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × italic_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT. However, when there is a large amount of repeated data in the Index corresponding to input tensors of high-rank, just like IndexA[0,0,1,1,1,3,4,,ma]𝐼𝑛𝑑𝑒subscript𝑥𝐴0011134subscript𝑚𝑎Index_{A}[0,0,1,1,1,3,4,...,m_{a}]italic_I italic_n italic_d italic_e italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT [ 0 , 0 , 1 , 1 , 1 , 3 , 4 , … , italic_m start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ], it will be very expensive in this way. So we use the input tensor A directly. In the IndexB𝐼𝑛𝑑𝑒subscript𝑥𝐵Index_{B}italic_I italic_n italic_d italic_e italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, Bimb(i=0,1,,n)subscript𝐵𝑖subscript𝑚𝑏𝑖01𝑛B_{i}\leq m_{b}(i=0,1,...,n)italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_i = 0 , 1 , … , italic_n ). Thus, for the rightness of retrieving BIsubscript𝐵𝐼B_{I}italic_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, we need padding IndexB𝐼𝑛𝑑𝑒subscript𝑥𝐵Index_{B}italic_I italic_n italic_d italic_e italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT to new 2d-index which size is mamrsubscript𝑚𝑎subscript𝑚𝑟m_{a}*m_{r}italic_m start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∗ italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT(mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is max repeat count of one number in IndexA𝐼𝑛𝑑𝑒subscript𝑥𝐴Index_{A}italic_I italic_n italic_d italic_e italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT), and excess positions are replaced with -1. In this case, mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is 3 since 1 had appeared 3 times. To avoid ambiguity, we call this new BIsubscript𝐵𝐼B_{I}italic_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT as BPsubscript𝐵𝑃B_{P}italic_B start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, and because B as smaller tensor compared with A, so it won’t increasing too much costs in this step. Finally, we got CP=A×BPsubscript𝐶𝑃𝐴subscript𝐵𝑃C_{P}=A\times B_{P}italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = italic_A × italic_B start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, then C can derive from CPsubscript𝐶𝑃C_{P}italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT through flatten its outside ranks a, c and then extract valid tensors in it.

4. Evaluation

We begin by presenting our experimental setup. Following that, we discuss the tools utilized in our performance analysis. Next, we conduct an evaluation of low-precision communication and provide usage advice for communication in RQC simulation. We proceed to assess the proposed technique step by step and identify the optimal solution for a single multi-node level subtask. Finally, we present and analyze the results of testing two large-scale tensor networks (specifically, 4T and 32T).

4.1. Experiment setup

We employ 80GB A100 GPUs, featuring peak FP16 Tensor core performance of 312 TFLOPS. Eight A100 GPUs within each node are interconnected via NVlink, providing a unidirectional speed of 300 GB/s, while nodes are connected via InfiniBand with a unidirectional speed of 100 GB/s. The task is implemented based on Pytorch framework (version 2.1), einsum calculation is performed using Cutensor (version 1.7.0) with cudatoolkit of version 12.0.

4.2. Measurement tools

For monitoring GPU power consumption in real-time, we have adopted a method that creates a subprocess continuously invoking the NVML library (nvml, ) with Python interface. This subprocess runs on the device level, passing its own machine rank to a specific function as a parameter, so the function can capture and store both the relative timestamp and the instantaneous power for each graphics card, performing these tasks at approximately 20-millisecond intervals. Since it is implemented by initiating a separate process, it almost does not impact the performance of the main process.

At the end of the tasks, we used the method of infinitesimal integration to calculate the total energy consumption throughout the process and sum up it on the global level as a result, which can reflect the consumption of the entire operation process. In Table 2, we present the measured power under three distinct scenarios.

Table 2. Measured Power per A100 GPU
Power per A100 GPU
Idle 60 W
Communication 90similar-to\sim135W
Computation 220similar-to\sim450W

Based on our designed quantization method in Subsection 3.2, we define a quantization compression rate (CR (%), ), which denotes the percentage of data to be compressed for communication,

(7) CR(%)=sizeof(Tscales)+sizeof(Tzeros)+sizeof(Tquant)sizeof(Torigin),CR(%)sizeofsubscript𝑇scalessizeofsubscript𝑇zerossizeofsubscript𝑇quantsizeofsubscript𝑇origin\text{CR(\%)}=\frac{{\text{sizeof}(T_{\text{scales}})+\text{sizeof}(T_{\text{% zeros}})+\text{sizeof}(T_{\text{quant}})}}{{\text{sizeof}(T_{\text{origin}})}},CR(%) = divide start_ARG sizeof ( italic_T start_POSTSUBSCRIPT scales end_POSTSUBSCRIPT ) + sizeof ( italic_T start_POSTSUBSCRIPT zeros end_POSTSUBSCRIPT ) + sizeof ( italic_T start_POSTSUBSCRIPT quant end_POSTSUBSCRIPT ) end_ARG start_ARG sizeof ( italic_T start_POSTSUBSCRIPT origin end_POSTSUBSCRIPT ) end_ARG ,

where T𝑇Titalic_T represents the tensor.

The results are compared to the benchmarks using the similarity function presented below, with similarity denoted as fidelity:

(8) fidelity=|benchmark,result2benchmarkresult|2.𝑓𝑖𝑑𝑒𝑙𝑖𝑡𝑦superscriptsuperscript𝑏𝑒𝑛𝑐𝑚𝑎𝑟𝑘𝑟𝑒𝑠𝑢𝑙𝑡2delimited-∥∥𝑏𝑒𝑛𝑐𝑚𝑎𝑟𝑘delimited-∥∥𝑟𝑒𝑠𝑢𝑙𝑡2fidelity=\left|\frac{\sqrt{\langle benchmark,result\rangle^{2}}}{\lVert benchmark% \rVert\lVert result\rVert}\right|^{2}.italic_f italic_i italic_d italic_e italic_l italic_i italic_t italic_y = | divide start_ARG square-root start_ARG ⟨ italic_b italic_e italic_n italic_c italic_h italic_m italic_a italic_r italic_k , italic_r italic_e italic_s italic_u italic_l italic_t ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG ∥ italic_b italic_e italic_n italic_c italic_h italic_m italic_a italic_r italic_k ∥ ∥ italic_r italic_e italic_s italic_u italic_l italic_t ∥ end_ARG | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

4.3. Assessment for low-precision communication

In this section, we explore the most effective method of data quantization for communication with the dual goals of minimizing energy consumption and achieving high-fidelity results. First, we analyze precision sensitivity within the stem path of the 4T tensor network. We then evaluate the impact of quantization for both intra-node and inter-node communication, and identify the quantization strategy for tackling the fidelity-energy trade-off.

4.3.1. Single step of quantization

We conducted stepwise quantization experiments based on single-step quantization (see Fig. 6).

Refer to caption
Figure 6. Quantitative compression rate (see Equation7) and relative fidelity in single step quantization. The curve represents the relative fidelity, calculated as the quantified fidelity divided by the complex64 fidelity. The percentage number represents CR (%).

Relative fidelity is less stable and has lower performance in the early stage of the task. The reason is that quantifying in the early stages of the task will result in more error accumulation than quantifying in the later stages of the task. Furthermore, the relative fidelity is independent of the amount of communicated data. Therefore, we attempt to quantify in the later stages of the task and choose a larger amount of data for quantification as much as possible to achieve higher returns. The marked dashed line on the graph is the quantization steps we ultimately chose.

4.3.2. Assessment for quantization of intra-node communication

The time of all-to-all communication is

(9) Tall2all=DataAmountbandwidthNN11r,subscript𝑇𝑎𝑙𝑙2𝑎𝑙𝑙𝐷𝑎𝑡𝑎𝐴𝑚𝑜𝑢𝑛𝑡𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡𝑁𝑁11𝑟T_{all2all}=\frac{Data\ Amount}{bandwidth}*\frac{N}{N-1}*\frac{1}{r},italic_T start_POSTSUBSCRIPT italic_a italic_l italic_l 2 italic_a italic_l italic_l end_POSTSUBSCRIPT = divide start_ARG italic_D italic_a italic_t italic_a italic_A italic_m italic_o italic_u italic_n italic_t end_ARG start_ARG italic_b italic_a italic_n italic_d italic_w italic_i italic_d italic_t italic_h end_ARG ∗ divide start_ARG italic_N end_ARG start_ARG italic_N - 1 end_ARG ∗ divide start_ARG 1 end_ARG start_ARG italic_r end_ARG ,

where N𝑁Nitalic_N represents the number of devices within a node, which is 8 in this case, and r denotes the bandwidth utilization rate in all-to-all communication, which is approximately 50% in practice. The energy consumption is proportional to:

(10) energyαTall2all+βTcalculation,proportional-to𝑒𝑛𝑒𝑟𝑔𝑦𝛼subscript𝑇𝑎𝑙𝑙2𝑎𝑙𝑙𝛽subscript𝑇𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑖𝑜𝑛energy\propto\alpha*T_{all2all}+\beta*T_{calculation},italic_e italic_n italic_e italic_r italic_g italic_y ∝ italic_α ∗ italic_T start_POSTSUBSCRIPT italic_a italic_l italic_l 2 italic_a italic_l italic_l end_POSTSUBSCRIPT + italic_β ∗ italic_T start_POSTSUBSCRIPT italic_c italic_a italic_l italic_c italic_u italic_l italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT ,

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are the coefficients of power consumption. Empirical data shows that the ratio αβ𝛼𝛽\frac{\alpha}{\beta}divide start_ARG italic_α end_ARG start_ARG italic_β end_ARG is approximately equal to 1313\frac{1}{3}divide start_ARG 1 end_ARG start_ARG 3 end_ARG. For every 1GB of data, the quantization kernel takes 4.25ms to process, whereas the reduction in communication time resulting from quantization, as per equation 9, is a mere 4.78ms. This suggests that the time-saving achieved through quantization is, indeed, insignificant.

Upon further analysis, taking into consideration Equation 10, the increase in overall energy consumption, denoted by the inequality α4.78+β4.25>0𝛼4.78𝛽4.250-\alpha*4.78+\beta*4.25>0- italic_α ∗ 4.78 + italic_β ∗ 4.25 > 0, results in a decline in performance. This revelation, combined with the earlier mentioned reduction in fidelity as described in Section 4.3.1, conclusively demonstrates that implementing quantization techniques within intra-node communication yields negative outcomes. As a result, the utilization of quantization for intra-node communication is not considered advantageous, and would not be adopted in the final task.

4.3.3. Assessment for quantization of inter-node communication

Despite the negative benefits of quantization during intra-node communication, Equations 9 and 10 suggest that quantization can have positive effects on inter-node communication. To provide a more precise and concrete demonstration, we conducted an end-to-end subtask of a 4T TN. The result is shown in Fig. 7.

We observe that the total time and energy consumption gradually decrease from float-precision to int4 (128), but remain relatively constant after int4 (128). The performance improvement from float-precision to int4 (128) is greater than the decrease in relative fidelity. Meanwhile, the improvement after int4 (256) is smaller than the decrease in relative fidelity. In summary, utilizing int4 with 128 being the group size for inter-node quantization yields the maximum positive impact. We adopt this as the final quantization scheme for our task, with a 6.55% loss in relative fidelity, the time decreased by 50.08%, and the energy consumption decreased by 30.23%.

Refer to caption
Figure 7. Time, energy, and relative fidelity after inter-node quantization on 4T tensor network. The orange curve represents relative fidelity, The green curve represents energy consumption, and the bar chart represents the calculation and communication time. The dashed line represents the experimental plan we ultimately adopted, int4 (128).
Table 3. Assessment the Impact of Proposed Methods on the Overall Performance of 4T Tensor Network without post-processing.
Data type of computation Data type of communication Hybrid communication Other optimizations nodes Energy (wh) Fidelity(%)
Yes/No? Inter (GB)/GPU Intra (GB)/GPU
float float No 36 No 8 19.78 1
float half No 18 No 8 16.48 99.999
half half No 36 No 4 13.03 99.995
half half Yes 28 20 No 4 12.67 99.995
half half Yes 24 40 Yes 2 10.57 99.965
half int8 Yes 12 40 Yes 2 10.12 99.912
half int4(128) Yes 6 40 Yes 2 9.89 98.007

4.4. Assessment of the proposed techniques

In this section, we aim to demonstrate how the proposed innovations improve performance incrementally. As shown in Table 3, each row represents a single subtask. We start from the baseline case, where no optimizations or quantizations are applied, using single-precision floating-point data types and setting the result as a benchmark.

Initially, we decreased the quantity of data by converting single-precision to half-precision for communication, hence leading to a 16.68% reduction in energy consumption with negligible loss of fidelity. Additionally, we implement half-precision for computational tasks, reducing the minimum number of nodes needed for a sub-network from 8 to 4. This change leads to a 20.93% decrease in energy usage while maintaining a negligible loss of accuracy.

Next, we split inter-node communication tasks with intra-node communication, as certain tensor network calculation steps do not require data to be permuted across all nodes. By redirecting some communication burden to intra-node, we observe an additional 2.76% reduction in energy consumption with minimal fidelity degradation. Subsequently, by leveraging the features of the 4T network, we implement a recomputation algorithm that reduces the minimal number of nodes required for a sub-network from 4 to 2, achieving a 16.57% energy reduction with no significant loss of fidelity.

Our final experiment employs int8 and int4 quantization for communication, where energy reductions of 4.25% and 6.43%, respectively, are achieved with fidelity losses within 2%. In conclusion, the proposed innovations progressively contribute to improving performance, enabling substantial energy savings and maintaining high fidelity in tensor contraction operations.

4.5. Verification and simulation of Sycamore circuits

Table 4. The metrics and results of the simulated Sycamore experiment. Bold numbers indicate that the result is better than sycamore.
methods
4T
no post-processing
4T
post-processing
32T
no post-processing
32T
post-processing
Time complexity (FLOP) 4.710174.7superscript10174.7*10^{17}4.7 ∗ 10 start_POSTSUPERSCRIPT 17 end_POSTSUPERSCRIPT 7.910167.9superscript10167.9*10^{16}7.9 ∗ 10 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT 1.310171.3superscript10171.3*10^{17}1.3 ∗ 10 start_POSTSUPERSCRIPT 17 end_POSTSUPERSCRIPT 1.610161.6superscript10161.6*10^{16}1.6 ∗ 10 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT
Memory complexity (elements) 3.110153.1superscript10153.1*10^{15}3.1 ∗ 10 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT 6.410146.4superscript10146.4*10^{14}6.4 ∗ 10 start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT 1.310151.3superscript10151.3*10^{15}1.3 ∗ 10 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT 1.610141.6superscript10141.6*10^{14}1.6 ∗ 10 start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT
XEB value (%) 0.2036 0.2059 0.21194 0.2158
Efficiency (%) 21.09 18.14 16.65 17.09
Total number of subtasks 218superscript2182^{18}2 start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT 218superscript2182^{18}2 start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT 212superscript2122^{12}2 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT 212superscript2122^{12}2 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT
Number of subtasks conducted 528 84 9 1
Nodes per subtask 2 2 32 32
Memory/Multi-node level (TB) 1.25 1.25 20 20
Computer resource (A100) 2112 96 2304 256
Time-to-solution (s) 32.51 133.15 14.22 17.18
Energy consumption (kwh) 5.77 1.12 2.39 0.29

In order to sample from the Sycamore circuit, we have applied our techniques to two large-scale tensor networks, namely 4T and 32T, with or without post selection (also known as post processing) (leapfrogging, ), aiming for k=3106𝑘3superscript106k=3\cdot 10^{6}italic_k = 3 ⋅ 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT uncorrelated samples. Our experiment has attained a peak half-precision performance of 561 PFLOPS, with approximately 20% of efficiency achievable through the application of the proposed techniques. A summary of our verification details is presented in Table 4, where bold numbers indicate that the metrics performed than Sycamore. Therefore, all test cases have surpassed the time consumed over Sycamore, and three cases have surpassed energy consumption. Among these, our best case demonstrates one magnitude of order less in both time and energy compared to Sycamore (32T with post-processing).

We compared the four cases of our simulation for the task of generating bitstring samples with a bounded fidelity of 0.002. Our analysis focused on three key aspects: the influence of post-selection, the size of the tensor network, and the scalability of the approach.

4.5.1. Assessment of Post-processing

As previously mentioned, the extensive tensor network can be divided into self-contained sub-networks, each of which is capable of being contracted at a single multi-node level. In the execution of our simulation, by utilizing a post-selection approach, it is only necessary to perform approximately 11.1%15.9%percent11.1percent15.911.1\%-15.9\%11.1 % - 15.9 % of the tasks that would have been required without post-selection to achieve an XEB value of 0.002. Thus, comparing the methods with and without post-selection, we can conclude that employing post-selection significantly reduces both time and space complexity.

4.5.2. Assessment of size of tensor network

The space complexity of the algorithm is proportional to s2M𝑠superscript2𝑀s2^{M}italic_s 2 start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, where s represents the size of the data type and M corresponds to the targeted contraction treewidth, which can be manually specified in the slicing algorithm. Consequently, the space complexity is completely controlled. Comparing methods between different sizes of tensor networks, it is evident that the time and space complexity decrease as the size of the tensor network increases (at the global level, not for a single multi-node level).

Refer to caption
Figure 8. The influence of global memory usage on (a) time-to-solution and (b) energy consumption The time-to-solution and energy consumption for simulating a quantum RCS experiment with XEB = 0.002 for samples obtained using different numbers of GPUs. In the 32T tensor network with post-processing, achieving XEB = 0.002 only requires a single multi-node task. Therefore, there is no need for parallelizing subtasks, resulting in just one point (blue circle in (a) and blue triangle in (b)), without any fitting lines.

4.5.3. Scalibility Exploration

In Fig. 8, we observe a linear decay in the total time-to-solution as we increase the number of GPUs used for computation, due to the parallel-friendly feature of the slicing algorithm and three-level parallel scheme. Specifically, our method achieves strong scalability from 128 GPUs (16 nodes) to 768 GPUs (96 nodes) for 4T tensor network with post-selection, from 271 GPUs (34 nodes) to 2112 GPUs (264 nodes) for 4T tensor network without post-selection, from 256 GPUs ( 2 nodes) to 2304 GPUs (288 nodes) for 32T tensor network without post-selection in terms of time. Moreover, the energy consumption remains at a constant level with an increase in nodes.

5. Conclusion

Previous studies have predominantly focused on identifying optimal paths for classical devices to efficiently compute tensor networks and store tensors within time and memory constraints, from an algorithmic perspective. In contrast, our work aims to enhance the capabilities of multi-node computational clusters and reduce computational complexity. This is achieved through the implementation of a three-level parallel scheme coupled with a carefully crafted inter-node and intra-node hybrid communication approach.

To significantly mitigate overhead costs associated with large tensors, we propose a low-precision communication approach and an einsum extension for complex-half data. Additionally, we introduce several specialized techniques to expedite specific tasks.

In our best-case without post-processing, we have reduced the time-to-solution to 14.22 seconds, with a power consumption of 2.39 kwh with fidelity of 0.002. Then with the technique of post-processing, we achieve our best performance, we remarkably reduce the time-to-solution for sampling 3×1063superscript1063\times 10^{6}3 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT bitstrings to just 17.18 seconds, with a power consumption of 0.29 kWh. This represents a one-order-of-magnitude reduction in both time and energy compared to the Google’s quantum processor Sycamore.

In this work, by leveraging state-of-the-art algorithms and a highly optimized scalable tensor network computation system based on GPU clusters, we have significantly surpassed Sycamore in both computing speed and energy consumption. We experimentally challenge Google’s initial claim of quantum supremacy. Our work resets the boundaries of classical computing capabilities and we hope to help the community evaluate quantum advantage more robustly in future quantum computational advantage experiments.

In future work, our techniques supporting large-scale tensor networks can be extended beyond merely RQC sampling simulation. They can be directly applied to diverse fields like quantum computing simulator (guerreschi2020intel, ), condensed matter physics (liu2021tropical, ) and combinatorial optimization (liu2023computing, ). This extended application intends to empower the scalability of problem models and facilitate the resolution of realistic challenges such as satisfiability, set packing, clique problems, etc.

Acknowledgements.
This work was supported by the National Key R&\&&D Program of China (No. 2022ZD0160201, No. 2022ZD0160101).

References

  • [1] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C Bardin, Rami Barends, Rupak Biswas, Sergio Boixo, Fernando GSL Brandao, David A Buell, et al. Quantum supremacy using a programmable superconducting processor. Nature, 574(7779):505–510, 2019.
  • [2] Han-Sen Zhong, Hui Wang, Yu-Hao Deng, Ming-Cheng Chen, Li-Chao Peng, Yi-Han Luo, Jian Qin, Dian Wu, Xing Ding, Yi Hu, et al. Quantum computational advantage using photons. Science, 370(6523):1460–1463, 2020.
  • [3] Han-Sen Zhong, Yu-Hao Deng, Jian Qin, Hui Wang, Ming-Cheng Chen, Li-Chao Peng, Yi-Han Luo, Dian Wu, Si-Qiu Gong, Hao Su, et al. Phase-programmable gaussian boson sampling using stimulated squeezed light. Physical review letters, 127(18):180502, 2021.
  • [4] Yu-Hao Deng, Yi-Chao Gu, Hua-Liang Liu, Si-Qiu Gong, Hao Su, Zhi-Jiong Zhang, Hao-Yang Tang, Meng-Hao Jia, Jia-Min Xu, Ming-Cheng Chen, et al. Gaussian boson sampling with pseudo-photon-number-resolving detectors and quantum computational advantage. Physical review letters, 131(15):150601, 2023.
  • [5] Yulin Wu, Wan-Su Bao, Sirui Cao, Fusheng Chen, Ming-Cheng Chen, Xiawei Chen, Tung-Hsun Chung, Hui Deng, Yajie Du, Daojin Fan, et al. Strong quantum computational advantage using a superconducting quantum processor. Physical review letters, 127(18):180501, 2021.
  • [6] Qingling Zhu, Sirui Cao, Fusheng Chen, Ming-Cheng Chen, Xiawei Chen, Tung-Hsun Chung, Hui Deng, Yajie Du, Daojin Fan, Ming Gong, et al. Quantum computational advantage via 60-qubit 24-cycle random circuit sampling. Science bulletin, 67(3):240–245, 2022.
  • [7] Koen Bertels, Aritra Sarkar, and Imran Ashraf. Quantum computing—from nisq to pisq. IEEE Micro, 41(5):24–32, 2021.
  • [8] Alexander Zlokapa, Benjamin Villalonga, Sergio Boixo, and Daniel A Lidar. Boundaries of quantum supremacy via random circuit sampling. npj Quantum Information, 9(1):36, 2023.
  • [9] Benjamin Villalonga, Dmitry Lyakh, Sergio Boixo, Hartmut Neven, Travis S Humble, Rupak Biswas, Eleanor G Rieffel, Alan Ho, and Salvatore Mandrà. Establishing the quantum supremacy frontier with a 281 pflop/s simulation. Quantum Science and Technology, 5(3):034003, 2020.
  • [10] Thomas Häner and Damian S Steiger. 0.5 petabyte simulation of a 45-qubit quantum circuit. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–10, 2017.
  • [11] Xin-Chuan Wu, Sheng Di, Emma Maitreyee Dasgupta, Franck Cappello, Hal Finkel, Yuri Alexeev, and Frederic T Chong. Full-state quantum circuit simulation by using data compression. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–24, 2019.
  • [12] Cupjin Huang, Fang Zhang, Michael Newman, Junjie Cai, Xun Gao, Zhengxiong Tian, Junyin Wu, Haihong Xu, Huanjun Yu, Bo Yuan, et al. Classical simulation of quantum supremacy circuits. arXiv preprint arXiv:2005.06787, 2020.
  • [13] Yong Liu, Xin Liu, Fang Li, Haohuan Fu, Yuling Yang, Jiawei Song, Pengpeng Zhao, Zhen Wang, Dajia Peng, Huarong Chen, et al. Closing the ”quantum supremacy” gap: achieving real-time simulation of a random quantum circuit using a new sunway supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2021.
  • [14] Yong Liu, Yaojian Chen, Chu Guo, Jiawei Song, Xinmin Shi, Lin Gan, Wenzhao Wu, Wei Wu, Haohuan Fu, Xin Liu, et al. Validating quantum-supremacy experiments with exact and fast tensor network contraction. arXiv preprint arXiv:2212.04749, 2022.
  • [15] Chu Guo, Yong Liu, Min Xiong, Shichuan Xue, Xiang Fu, Anqi Huang, Xiaogang Qiang, Ping Xu, Junhua Liu, Shenggen Zheng, et al. General-purpose quantum circuit simulator with projected entangled-pair states and the quantum supremacy frontier. Physical review letters, 123(19):190501, 2019.
  • [16] Riling Li, Bujiao Wu, Mingsheng Ying, Xiaoming Sun, and Guangwen Yang. Quantum supremacy circuit simulation on sunway taihulight. IEEE Transactions on Parallel and Distributed Systems, 31(4):805–816, 2019.
  • [17] Feng Pan, Keyang Chen, and Pan Zhang. Solving the sampling problem of the sycamore quantum circuits. Phys. Rev. Lett., 129:090502, Aug 2022.
  • [18] Feng Pan and Pan Zhang. Simulation of quantum circuits using the big-batch tensor network method. Phys. Rev. Lett., 128:030501, Jan 2022.
  • [19] Xian-He Zhao, Han-Sen Zhong, Feng Pan, Zi-Han Chen, Rong Fu, Zhongling Su, Xiaotong Xie, Chaoxing Zhao, Pan Zhang, Wanli Ouyang, et al. Leapfrogging sycamore: Harnessing 1432 gpus for 7×\times× faster quantum random circuit sampling. arXiv preprint arXiv:2406.18889, 2024.
  • [20] Xun Gao, Marcin Kalinowski, Chi-Ning Chou, Mikhail D. Lukin, Boaz Barak, and Soonwon Choi. Limitations of linear cross-entropy as a measure for quantum advantage. PRX Quantum, 5:010334, Feb 2024.
  • [21] Yaojian Chen, Yong Liu, Xinmin Shi, Jiawei Song, Xin Liu, Lin Gan, Chu Guo, Haohuan Fu, Jie Gao, Dexun Chen, and Guangwen Yang. Lifetime-based optimization for simulating quantum circuits on a new sunway supercomputer. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP ’23, page 148–159, New York, NY, USA, 2023. Association for Computing Machinery.
  • [22] Sebastian Brandhofer, Ilia Polian, and Kevin Krsulich. Optimal partitioning of quantum circuits using gate cuts and wire cuts. IEEE Transactions on Quantum Engineering, 2023.
  • [23] Feng Pan, Hanfeng Gu, Lvlin Kuang, Bing Liu, and Pan Zhang. Efficient quantum circuit simulation by tensor network methods on modern gpus. arXiv preprint arXiv:2310.03978, 2023.
  • [24] Hans De Raedt, Fengping Jin, Dennis Willsch, Madita Willsch, Naoki Yoshioka, Nobuyasu Ito, Shengjun Yuan, and Kristel Michielsen. Massively parallel quantum computer simulator, eleven years later. Computer Physics Communications, 237:47–61, 2019.
  • [25] Guifré Vidal. Efficient classical simulation of slightly entangled quantum computations. Physical review letters, 91(14):147902, 2003.
  • [26] Koen De Raedt, Kristel Michielsen, Hans De Raedt, Binh Trieu, Guido Arnold, Marcus Richter, Th Lippert, Hiroshi Watanabe, and Nobuyasu Ito. Massively parallel quantum computer simulator. Computer Physics Communications, 176(2):121–136, 2007.
  • [27] Benjamin Villalonga, Sergio Boixo, Bron Nelson, Christopher Henze, Eleanor Rieffel, Rupak Biswas, and Salvatore Mandrà. A flexible high-performance simulator for verifying and benchmarking quantum circuits implemented on real hardware. npj Quantum Information, 5(1):86, 2019.
  • [28] Jianxin Chen, Fang Zhang, Cupjin Huang, Michael Newman, and Yaoyun Shi. Classical simulation of intermediate-size quantum circuits. arXiv preprint arXiv:1805.01450, 2018.
  • [29] Johnnie Gray and Stefanos Kourtis. Hyper-optimized tensor network contraction. Quantum, 5:410, 2021.
  • [30] Xin Liu, Chu Guo, Yong Liu, Yuling Yang, Jiawei Song, Jie Gao, Zhen Wang, Wenzhao Wu, Dajia Peng, Pengpeng Zhao, et al. Redefining the quantum supremacy baseline with a new generation sunway supercomputer. arXiv preprint arXiv:2111.01066, 2021.
  • [31] Ze-bin Wu and Jun-qing Yu. Vector quantization: a review. Frontiers of Information Technology & Electronic Engineering, 20(4):507–524, 2019.
  • [32] Haibao Yu, Tuopu Wen, Guangliang Cheng, Jiankai Sun, Qi Han, and Jianping Shi. Gdrq: Group-based distribution reshaping for quantization. arXiv preprint arXiv:1908.01477, 2019.
  • [33] Nvidia management library (nvml). https://developer.nvidia.com/nvidia-management-library-nvml. Accessed: 2024-03-27.
  • [34] Gian Giacomo Guerreschi, Justin Hogaboam, Fabio Baruffa, and Nicolas PD Sawaya. Intel quantum simulator: A cloud-ready high-performance simulator of quantum circuits. Quantum Science and Technology, 5(3):034007, 2020.
  • [35] Jin-Guo Liu, Lei Wang, and Pan Zhang. Tropical tensor network for ground states of spin glasses. Physical Review Letters, 126(9):090506, 2021.
  • [36] Jin-Guo Liu, Xun Gao, Madelyn Cain, Mikhail D Lukin, and Sheng-Tao Wang. Computing solution space properties of combinatorial optimization problems via generic tensor networks. SIAM Journal on Scientific Computing, 45(3):A1239–A1270, 2023.