A Hybrid Parallel Strategy for Isogeometric Topology Optimization via CPU/GPU Heterogeneous Computing

This paper aims to solve large-scale and complex isogeometric topology optimization problems that consume significant computational resources. A novel isogeometric topology optimization method with a hybrid parallel strategy of CPU/GPU is proposed, while the hybrid parallel strategies for stiffness matrix assembly, equation solving, sensitivity analysis, and design variable update are discussed in detail. To ensure the high efficiency of CPU/GPU computing, a workload balancing strategy is presented for optimally distributing the workload between CPU and GPU. To illustrate the advantages of the proposed method, three benchmark examples are tested to verify the hybrid parallel strategy in this paper. The results show that the efficiency of the hybrid method is faster than serial CPU and parallel GPU, while the speedups can be up to two orders of magnitude.


Introduction
Developing advanced manufacturing techniques [1,2] puts forward new requirements for design tools.Among design approaches, topology optimization (TO) is considered one of the most prospects for generating product prototypes during the conceptual design stage.Over the past few decades, TO has been improved significantly [3] and applied to various complex problems such as fluid-structure interaction [4], and thermos-elastic behavior [5].Bendsøe et al. [6] proposed a homogenization method, laying a foundation for developing TO methods.According to the model expression, TO is roughly divided into two categories.One is geometric boundary representation-based methods [7][8][9].The other is material representation-based methods [10][11][12], in which structural topology is defined by 0-1 distribution of material and evolved by making a material trade-off.Among them, the solid isotropic material with penalization (SIMP) is the most classic method based on variable density theory with the advantages of simple program implementation and stable solution.The SIMP is widely applied to various fields including multiscale and multi-material [13].Doan et al. [14] presented a new computational design optimization method that finds the optimal multi-material design by considering structure strain energy and material cost.In most TOs, the finite element method (FEM) is employed to analyze displacement field and sensitivity.However, due to the disconnection between the geometric model and analysis [15], there are some errors in the calculation.Moreover, the Lagrange basis function continuity between adjacent elements is low, reducing the analysis accuracy [16].
To improve the accuracy of optimization, isogeometric analysis (IGA) was introduced [17][18][19] by using unified Non-Uniform Rational B-splines (NURBS) basis functions for the geometric and computational models.With the merits of high accuracy and efficiency, IGA-based TOs have been intensively studied [20].Dedè et al. [21] utilized a phase field model for the formulation and solution, and encapsulated the exactness of the design domain in TO by the IGA-based spatial approximation.In the optimization of the lattice structure, the IGA is used to analyze the effective property for either isotropic or an-isotropic cellular microstructures [22][23][24].However, the computational cost of TO is expensive for the complex large-scale model, since the number and order of elements need to be large enough for high accuracy.Especially for the IGA-based TO, the optimization analysis with the highorder NURBS elements leads to a further rise in computational complexity and memory usage [24,25].Furthermore, TO is an iterative computing process and the computational cost will rise significantly with the increasing scale and complexity.Parallel computing technology has been investigated to accelerate the process of TO.In earlier work, Kim et al. [26] made use of parallel topology optimization to solve large-scale eigenvalue-related structural design problems.Subsequently, Vemaganti et al. [27] presented a parallel algorithm for 2D structure topology optimization based on the solid isotropic material with the penalization (SIMP) method and the optimality criteria (OC).Aage et al. [28] presented how to use PETSc for parallel computing and successfully applied it to solving large-scale topology optimization in parallel.A minimum weight formulation with parallelization techniques was used to accelerate the solving of the topology optimization problem in [29].Since graphics processing units (GPUs) have an architecture that supports the large number of threads required for parallel computing, they can be applied for high-performance solutions to large-scale complex scientific problems [30,31].Wadbro et al. [32] first exploited the parallel computing capabilities and programmability of GPUs to accelerate topological optimization methods.Schmidt et al. [33] used GPU to accelerate the SIMP method, and experimental results demonstrate that the parallel algorithm on the GeForce GTX280 runs faster than a 48-core shared memory central processing units (CPUs) system with a speed-up ratio of up to 60. Ratnakar et al. [34] presented an implementation of topology optimization on the GPU for a 3D unstructured mesh by developing efficient and optimized GPU kernel functions.Karatarakis et al. [35] proposed the interaction-wise approach for the parallel assembly of the stiffness matrix in IGA, which enables the efficient use of GPUs to substantially accelerate the computation.There are rare research papers focusing on the parallel strategy for isogeometric topology optimization (ITO).Xia et al. [25] proposed a GPU parallel strategy for level set-based ITO and obtained a speedup of two orders of magnitude.Wu et al. [36] used an efficient geometric multigrid solver and GPU parallelization in the FEM analysis session to accelerate the topology optimization iterations on a desktop.However, the above-mentioned studies focus on the efficient utilization of GPU, while the computational capacity of the CPU was ignored.The open multi-processing (OpenMP) based CPU parallel and compute unified device architecture (CUDA) based GPU parallel [37] have been incorporated into optimization algorithms to accelerate their process.Lu et al. [38] first exploited the computational capacities of both CPUs and GPUs in the Tianhe-1A super-computer to perform a long-wave radiation simulation, while the ways to distribute the workload between CPU and GPU to achieve high computational efficiency were discussed.Subsequently, Cao et al. [39] took into account the cost of communication between GPU and CPU and developed a formula method for workload allocation.However, there are rare research papers focusing on parallel strategy both with CPU and GPU for ITO.The challenge in designing ITO heterogeneous parallel algorithms is to achieve workload balancing on the CPU/GPU to ensure computational efficiency.Meanwhile, the minimum mapping range of GPU to host memory is determined to improve the efficiency of memory resource usage and reduce the data transfer time from CPU to GPU.
There are few literatures on ITO with heterogeneous parallelism acceleration.In this paper, a hybrid parallel strategy for ITO with CPU/GPU heterogeneous computing is proposed to accelerate the main time-consuming aspects of the computational processes.The hybrid parallel strategy for stiffness assembly based on control point pair is achieved by CPU/GPU hybrid computing for the first time, contributing to efficiency improvements.A dynamic workload balancing method is presented for its efficiency and versatility.The tasks are assigned according to the real-time local computing power measured by the pre-run phase.The rest of the paper is structured as follows: NURBS-based IGA and CPU/GPU heterogeneous parallel computing are briefly reviewed in Section 2. Section 3 illustrates the hybrid parallel strategy for ITO processes, including stiffness matrix assembly, equation solving, sensitivity analysis, and update scheme.A dynamic workload balancing method is proposed in Section 4. The advantages and correctness of the hybrid parallel strategy are demonstrated with several benchmark cases in Section 5. Finally, Section 6 concludes the paper and presents an outlook on future research.

Basic Theory
The theoretical foundations including IGA, ITO-SIMP and CPU/GPU heterogeneous computing [40,41] are summarized in this section.

NURBS Basic Theory
In IGA, NURBS is commonly used to discretize the design domain [42].A knot vector Ξ , representing parametric coordinates, is a sequence of non-decreasing real numbers: where n is the number of control points, and p denotes the order of the B-spline.By the Cox-de Boor formula, the B-spline basis functions B p i (ξ ) can be derived recursively from the given parameter vector [43]: NURBS basis function N p i (ξ ) can be obtained by introducing a positive weight w i to each B-spline basis function [44]: Based on the tensor property, three-dimensional NURBS basis functions N p,q,r i•j,k (ξ , η, ζ ) are produced from the following formula [18]: where w i,j,k is the weight value of the tensor product

SIMP-Based ITO
SIMP material model is implemented to search for the optimized solution in ITO.The design variable is the density x, which enables the distribution of the material under control [45].ITO-SIMP aims to maximize the structural stiffness, which can be converted to minimize compliance.In ITO-SIMP, the density variables are stored at the control points, and the element density x e can be illustrated with the control point density as [46]: where the density of element e is equivalent to the element center x n (ec).m is the set of control points related to element e. N i denotes the NURBS basis function of the ith control point, and the corresponding density is written as x i .
Based on the SIMP material model, Young's modulus E e x e of the element can be represented as [47]: where E 0 is Young's modulus of the base material.Penalty coefficient t is greater than 1, which penalizes the material's stiffness.
The SIMP-based topology optimization is to find the distribution of material for the minimum compliance, which can be mathematically illustrated as follows [48]: where C is the compliance, K represents the global stiffness matrix, F denotes the load vector, and U is the global displacement field.k e denotes the element stiffness matrix calculated from unit Young's modulus when u e is the element displacement vector.θ is the volume fraction, while V 0 and V (x) denote the volume of the design domain and material, respectively.x e values from 0 to 1 to avoid the singularity of the stiffness matrix.

CPU/GPU Heterogeneous Computing 2.3.1 GPU Parallel Architecture
GPUs are computer graphics processors which can compute extensive data in parallel [49].Since NVIDIA released CUDA in 2007, many researchers have been using GPUs to accomplish largescale scientific computing problems [50].The CUDA programming model provides a heterogeneous computing platform consisting of CPU and GPU architectures.Their applications are divided into CPU host-side and GPU device-side code, while the information is exchanged via the peripheral component interconnect express (PCIe) bus.Host-side code is responsible for controlling device and data transfer, while device-side code defines operational functions to perform the corresponding kernel functions.Thread is the smallest execution unit, while GPU uses many threads to execute kernel functions during parallel computing.Logically, all threads are grouped into blocks by a certain number.The threads in the block will run in warps (set of 32 threads) on the CUDA core processor, as shown in Fig. 1.Warp is the execution unit of streaming multiprocessor (SM), while SM supports concurrent execution of a large number of threads and threads are managed in a single-instructionmultiple-threads (SIMT) fashion.Multi-core CPUs compute in parallel with fewer cores and have more arithmetic power per core than GPUs [51].CPU/GPU heterogeneous parallel programming model is based on a heterogeneous computing platform where computing power involving both GPUs and CPUs is considered [52].OpenMP supports multi-threaded concurrent execution of tasks on multi-core CPUs [53].The independence of CPU cores allows different tasks to be performed simultaneously among different OpenMP threads.Typically, the CPU is involved in controlling GPU (e.g., the transfer of data and the launching of kernel functions) but not computing tasks.Indeed OpenMP is used in CPU/GPU heterogeneous parallel programming to enable multi-threading of the CPU, where one of the OpenMP threads is responsible for interaction with the GPU and others for computation [54].Hence, the CPU and GPU work concurrently and cooperatively for the particular workload.As shown in Fig. 2, the total workload is divided into CPU and GPU parts.The CPU runs in "one-thread-multi-node" mode while each thread iterates through multiple tasks in a loop.Moreover, for the GPU, it operates in "one-thread-one-node" mode, while each thread performs only one task.The CPU/GPU heterogeneous parallel computing is expected to accelerate the ITO computational processes.The proposed CPU/GPU hybrid parallel strategy for ITO consists of stiffness matrix assembly, equation solving, sensitivity analysis, and design variable update.

Strategy for Stiffness Matrix Assembly
The global stiffness matrix assembly consumes substantial computational resources.A parallel strategy is to calculate the local stiffness matrix among threads, where the contributions of Gaussian points in each element are summed up [55]: where B G is the deformation matrix calculated on Gaussian points and w G is the weight factor.Each local stiffness matrix is appended to the global stiffness matrix K in the corresponding locations:

Thread Race Condition in Heterogeneous Parallelism
Theoretically, assembling a global stiffness matrix among elements can be directly executed [56].However, due to shared control points among elements, a memory address may be written by multiple threads when the element-wise heterogeneous parallel strategy shown in Fig. 3 is employed.Such a conflict, called a thread race condition, will lead to incorrect updates on the stiffness coefficients.
Although atomic operations can avoid race conditions, the efficiency of heterogeneous parallelism would be significantly reduced [57], and the assembly process would be critically degraded to serialization.To fundamentally avoid race conditions and maintain the efficiency of parallel computation, a hybrid parallel strategy for stiffness matrix assembly based on the control point pair is proposed herein.The workload is appropriately assigned between the host CPU and device GPU, while the heterogeneous parallel threads are divided by interacting i-j control point pair as shown in Fig. 4. Considering the control point pair shared by elements, as shown in Fig. 5, the local stiffness matrix k e of each element is discretized into a series of submatrices H ij defined at the control point pair [35]: where B i , B j are the deformation matrix corresponding to the i-j control point pair, and D is the elasticity matrix.The submatrices H ij on all shared Gaussian points are calculated and multiplied by the weight factors, then summed to generate the final coefficients K ij of the global matrix K: Thread i Thread j 0 0,0,0 0,1,0 0,2,0 1,0,0 1,1,0 1,2,0 2,0,0 2,1,0 2,2,0 3,0,0 3,1,0 3,2,0 1  The proposed hybrid parallel strategy for stiffness matrix assembly is based on interacting control point pair.Synchronized operations between threads on GPU and CPU can be avoided to make the algorithm applicable for efficient hybrid parallel computing.There are two phases: (1) the derivatives of the shape functions are calculated for all influenced Gaussian points.The computational workload is divided by element, in which a set of Gaussian points are calculated for shape function derivatives.
(2) each heterogeneous parallel thread calculates derivatives in each element, as shown in Fig. 6, which increases the flexibility for calculating the global stiffness coefficient.

Control point
Control point pair shared by elements:  The shape function derivatives are stored in GPU global memory and CPU shared memory in the second phase.As shown in Fig. 7, the threads can access the random memory addresses and concurrently access the same memory address among threads.The computational workload is divided by control point pairs.Each thread completes the numerical integration process for shared Gaussian points of the pair, and then calculates w G H ij submatrices as Eq. ( 12).Finally, the parallel threads will fill stiffness coefficients into the corresponding unique positions of matrix K. Race condition will be eliminated by the hybrid parallel strategy, a precondition for efficient parallel computing.In addition, the total computation task can be divided into multiple fine-grained subtasks between CPU and GPU, which will contribute to efficiency improvements.The simplified heterogeneous parallel algorithm for stiffness matrix assembly is stated in Table 1."One-thread-one-stiffness matrix" mode in GPU and "one-thread-multi-stiffness matrix" mode in CPU are adopted in the hybrid parallel strategy.The symbol ← indicates variable assignment operations in local memory, and the double-linear arrow ⇒/⇐ indicates global memory read/write operations.Table 1 shows the first phase of the heterogeneous parallel strategy for the stiffness matrix assembly.The sensitivityFilter() function is a filtering scheme for smoothing free design boundaries in narrow-band regions.By using a window function to filter the pseudo-density of the element, the smoothness of the strain energy density is improved.The spaceConverter() function is for calculating the coordinates of the control points in parameter space.The JacobianMapping() function is used to transform Jacobian matrix.The Nurbs3Dders() function calculates the partial derivative values of the shape function in parameter space and then multiplies them by Jacobian inverse matrix.The results will be stored in matrix d_dRdx as information for the second stage of the calculation.2. The DOFs indicate the locations of the stiffness coefficients in matrix K.Each thread iterates through the elements shared by the control point pairs.The shape function derivatives of node pairs are obtained according to the local indices of control points in the element, while the stiffness coefficients K ij can be calculated by integrating overall shared Gaussian points.The sparse matrix K is compressed and stored in COO format to save memory, which only records non-zero element information.Arrays of the C/C++ structure store three vectors: the row and column index vectors (iK, jK) and the non-zero value vectors (vK).Unlike adding the contribution of local stiffness k e to assemble the matrix K, the final stiffness coefficient can be directly generated in the hybrid parallel strategy.Therefore, there are no repeated combinations of row and column indices.Non-zero values in matrix K are specified by the unique combinations of row and column as shown in Fig. 8.

Strategy for Equation Solving
A fast solving of equilibrium equations can significantly accelerate optimization iteration [58].A hybrid parallel strategy of PCG (preconditioned conjugate-gradient method) is studied herein to improve equation-solving efficiency.

Preconditioned Conjugate-Gradient Method
Conjugate-gradient method (CG) is an iterative method for solving systems of linear algebraic equations, preconditioned conjugate-gradient method (PCG) adopts a preconditioner to adjust the coefficient matrix in the equation to increase the convergence [59].A series of approximate solutions are obtained during the iterations, and the iteration finally ends once the error reaches the given tolerance.Applying PCG to solve the equation Kx = f in ITO, the algorithm can be described as: Where M denotes the preconditioning matrix, and r k is the error between approximate and accurate solutions.In the PCG method, the matrix M should make the condition number of (M −1 K) close to 1 according to the convergence rate [60]: where c is the condition number of the coefficient matrix K.When the c(M −1 K) is closer to 1 than c(K), the convergence will be accelerated considerably.
An incomplete Cholesky factorization method is utilized to obtain a well-performing preconditioning matrix M, which will be factorized as follows: where L is a lower triangular matrix.To accelerate the convergence, condition number c((LL) T K) is closer to 1 than c(K).
From Table 3, the computation of the vector dot product z T k+1 r k+1 , while z T k r k are independent during the iteration.Overlapping the independent computations will reduce the time of equation solving.

Hybrid Parallel Strategy of PCG
The CUDA stream, a kind of logical queue, is utilized for the hybrid parallel strategy of PCG.Different streams can execute multiple commands concurrently on NVIDIA GPU [61,62], while the sequence of operations is performed serially in order.Independent computations are executed in different CUDA streams, making the original serial process parallel.As shown in Fig. 9, the same number of CPU threads as the CUDA streams are adopted.Each CUDA stream executes different parallel operations concurrently, and OpenMP threads can update data before or after the stream launching.The CPU threads launch kernel functions concurrently and complete related calculations of kernel functions.Based on OpenMP, the total delay time for launching kernel functions in the serial is reduced, and the data processing for different kernel functions is executed in respective threads.It is beneficial to avoid synchronizing streams to update data in the master thread.The simplified heterogeneous parallel algorithm of PCG is shown in Table 4.In each iteration, the cuSPARSE library function cusparseSpSV_solve() is applied to solve the sparse triangular linear system d_zm1 ⇐ (L T ) −1 * L −1 * d_r1, i.e., z k+1 = M −1 r k+1 , which is a key to achieve an efficient PCG solution.The multiplication of a sparse matrix matA and a dense vector d_p is performed by cuSPARSE library function cusparseSpMV ().The sparse matrix is compressed and stored in CSR format.Kernel functions myDcopyKernel() and myDdotKernel() are designed to perform copying and dot product of dense vectors.The kernel function myDscalKernel() is used to calculate a vector and scalar multiplication.Function myDaxpyKernel() computes d_x ⇐ alpha * d_p + d_x, which multiplies the vector d_p by the scalar alpha and adds it to the vector d_x.The OpenMP compile command #pragma omp parallel sections initially create the threads (forks), and the command #pragma omp section is followed by independent phases executed concurrently in each CPU worker thread.4), the material properties of the element in SIMP model are represented by Young's modulus, and compliance C can be formulated as a summation of the element strain energy multiplied by Young's modulus [63].Therefore, the element strain energy with unit Young's modulus Se is calculated as:

Time Line
then the compliance C can be described as: Therefore, the compliance sensitivity term ∂C ∂x e can be described as: ∂C ∂x e = −t (x e ) t−1 S e = −t (x e ) t−1 u T e K e u e (17) In the process of sensitivity analysis, the calculation of strain energy is parallelized as the main time-consuming part [64].The heterogeneous parallel strategy for sensitivity analysis is illustrated in Table 5.The task set is divided by element, as the strain energy of an element is calculated in a task.In the hybrid parallel strategy, the "one-thread-one-strain energy" mode in GPU and the "one-threadmulti-strain energy" mode in CPU are adopted.

Hybrid Parallel Strategy for Update Scheme
For discrete optimization problems with many design variables, iterative optimization techniques such as the moving asymptote method and optimality criterion (OC) method are usually adopted [65].The OC method is chosen herein due to its efficiency with a few constraints.A heuristic scheme in OC iteration updates the design variables.Following the optimality condition, B e can be written as: where V is the material volume, is the Lagrange multiplier for the constraint.Finally, the update method can be illustrated as: where m is the move limit and η is the damping factor set to 0.3.
Here, the design variable x is updated in heterogeneous parallel during each OC iteration.The workload is divided by element.Table 6 shows the procedure of the update method, and the strategy is "one-thread-one-design variable" mode in GPU and "one-thread-multi-design variable" mode in CPU.

Strategy for CPU-GPU Data Transfer
A large amount of data transfer between CPU and GPU in the hybrid parallel strategy implementation is required, which is time-consuming.Therefore, achieving efficient data transfer is crucial for CPU/GPU hybrid computing.To obtain high performance in CPU/GPU heterogeneous parallel computing, an efficient data transfer method is adopted.

Data Flow between CPU and GPU
In the CPU/GPU-based heterogeneous computing system, the architecture and memory system of the CPU are different from the GPU, so the GPU cannot directly access the memory of the CPU for computation.When performing heterogeneous parallel computation, the computational data will be transferred from the CPU to GPU side.Depending on the specific hardware and software, the data flow process between the CPU host side and GPU device side is shown in Fig. 10: In the hybrid parallel strategy, the data is written to system memory first by the CPU.Then, a direct memory access (DMA) request to start the data transfer will be sent to the GPU by the CPU.With DMA, a data transfer execution is initiated by the CPU, then the dedicated DMA controller on the system bus will perform the transfer between the CPU and GPU.Thus, the involvement in the data transfer of the CPU is avoided, which frees it up to perform other tasks.

CPU-GPU Data Transfer Method for Hybrid Parallel Strategy
In the hybrid parallel strategy, the CPU memory is set as page-locked memory to ensure highly efficient data transfer between the CPU and GPU.The page-locked memory offers several advantages, while the bandwidth between the CPU and GPU memory will be higher, and the transfer speed will be faster.Page-locked memory allows the GPU to perform data transfer tasks directly through DMA engine without CPU involvement, reducing overall latency and decreasing transfer time.In addition, some asynchronous concurrent execution based on the page-locked memory is allowed in CUDA.Many researchers have explored overlapping data transfer and kernel execution with speed-up results when utilizing CUDA [65].This approach is challenged in its direct application to ITO hybrid parallel strategy and will be integrated into future work, as the data set is hard to divide into chunks of suitable size for each kernel execution.
Several functions are provided by CUDA runtime for locked-page memory.One is cudaHostAlloc(), allocating new locked-page host memory; the other, cudaHostRegister(), can fix the allocated unlocked-page memory into being locked-page.The latter is adopted in the data transfer method.Then, cudaMemcpyAsync() is applied to transfer data asynchronously from the CPU to GPU.The process of data transfer will be completed by the GPU and signaled to the CPU, which allows the CPU to overlap data transfers with other computations, improving performance and reducing overall execution time.
In the hybrid parallel strategies proposed in this paper, the whole workload is split into two parts and the tasks will be allocated to the CPU and GPU.The GPU's task set only corresponds to a portion of the resource data in the host, which provides an opportunity to reduce data transfer time by minimizing communication between the CPU and GPU.To minimize the communication, the corresponding range for vectors of GPU should be figured out first.For example, in the process of sensitivity analysis, the workload is divided by elements, where the corresponding range for vectors such as indices of elements can be easily determined.When transferring data from the CPU to the GPU, only related data are transferred, which saves communication time.

Loading Balance Strategy for CPU/GPU Heterogeneous Computing
In heterogeneous parallel computing, the loading balance strategy is key to ensuring computation efficiency.Thus a dynamic workload balancing method is proposed in this section.

CPU/GPU Computing for ITO
Computing resources in heterogeneous clusters include one multi-core CPU and one many-core GPU.In some GPU parallel studies, the CPU is responsible for data preparation and transfer, while GPU performs arithmetic operations [66,67].However, some CPU cores are idle when preparing and transferring data for GPU, resulting in a waste of computational resources [68].Therefore, cooperative computation for a particular workload is researched herein.
As described in Section 3.1, the workload for the first phase of stiffness matrix assembly can be subdivided into N x * N y * N z independent tasks (N x , N y , N z denote the mesh size in X, Y, Z axis directions).Moreover, the workload for the second phase is subdivided into N P independent tasks, where N P is the number of control point pairs.Therefore, the workload can be flexibly distributed between CPU and GPU, as shown in Fig. 11.The workload represents the total number of tasks and is divided into two parts: one core in CPU is reserved for data interaction, and (n−1) CPU cores are to handle the workload (1−α), where α denotes the workload balancing ratio between CPU and GPU.

Dynamic Workload Balancing
For heterogeneous parallelism, balancing the workload between the CPU and GPU with different arithmetic capabilities for efficient computing is critical [69,70].There are three main methods to evaluate the best workload balancing ratio α: the enumeration method, the formula method, and the pre-run method [71,72].In the enumeration method, all possible workload balancing strategies are executed, and then the best workload balancing ratio α with the shortest time is chosen.The formula method requires quantifying the computing power of hardware devices.δ CPU and δ GPU denote the computing power of one CPU core and all GPU cores, respectively, while the computing power of the whole CPU is (n−1)δ CPU.Then the wall-clock time τ for CPU/GPU computing can be expressed as: where τ CPU is the wall-clock time for CPU computing and τ GPU is for GPU.The total computing time τ is determined by the greater one of τ CPU and τ GPU .Therefore, when τ CPU equals τ GPU , the total computing time is minimized to avoid the mutual waiting between CPU and GPU.Thereby, the best workload balancing ratio α and workload CPU , GPU can be expressed as: The formula method requires accurate quantification of hardware computing power.Although this can be obtained directly from the APIs, the actual computational efficiency is affected by the parallel algorithm and hardware running.Therefore, a dynamic workload balancing method combining the formula and pre-run method is proposed in this paper, while the pre-run method is utilized to amend the formula method (theoretical value) for the main parameters of workload balancing.Assuming that there are N x × N y × N z independent tasks, τ CPU can be written as: where τ CPU is the computation time to execute one task for the CPU.Taking into account the time consumed by the CPU and GPU data transfer, τ GPU can be written as: where τ DT is the time for data transfer, and τ G is the time for GPU computation.When the workload balancing ratio α is given, τ DT and τ G can be evaluated as: (24) where k denotes the space complexity factor, S val is the bytes per data unit, and v is the bandwidth capacity of the PCI-E bus data transfer connecting the CPU and GPU.t dt denotes the average data transfer time for one task, and t GPU denotes the computation time to execute one task by GPU.According to Eqs. ( 22) and ( 23), the total computing time τ is minimized when τ CPU = τ GPU as follows: then the workload balancing ratio α can be expressed as: In the dynamic method, the pre-run phase aims to get the actual data transfer time t dt and the computation time t GPU and t CPU as shown in Fig. 12.The workload pre of the pre-run phase is greater than (n−1), ensuring that each CPU core is loaded.After the pre-run phase, the execution times 1 and 2 for CPU and GPU are tailed.The formula has a pre-run part, making the load balancing in real-time.Hence the data transfer time and the computation time can be evaluated: Finally, the workload balancing ratio α can be expressed by 1 and 2 as follows: The time consumed for the data transfer and the computation per task does not change as the workload increases.The dynamic workload balancing algorithm is illustrated in Table 7.The total computational tasks in ITO hybrid parallel strategies can be divided by control point pairs or elements.For independence, the tasks are quite suitable for the workload balancing method based on a task set division.Through the balancing, the tasks are assigned according to the real-time local computing power measured by the pre-run phase.Therefore, the proposed dynamic workload balancing algorithm is reliable and versatile.12: Return Ret;

Numerical Experiments
There are three benchmarks examined to verify the performance of the heterogeneous parallel ITO algorithm.Poisson's ratio v = −εl/ε is set to 0.3, where εl is the strain in the vertical direction, ε is the strain in the load direction.The modulus of elasticity E 0 is 1.0 for solid materials and 0.0001 for weak materials, and the convergence criterion r = (C i-1 -C i )/C i is set to 0.01, where C i is the compliance in the i th OC iteration.When displaying the topology structure, the element density x e has a threshold value of 0.5, which means that the density of elements below 0.5 is not displayed.The filter radius fr is empirically set to 0.04 times the maximum length of the mesh in the axial direction.All examples are running on a desktop.The Intel Xeon Gold 5218 2.3 GHz CPU contains 16 CPU cores, and the RAM is DDR4 SDRAM (128 GB).The GPU is NVIDIA GeForce RTX 3090, which contains 5888 streaming multiprocessors and 10496 CUDA cores.The desktop OS is Windows 10.1 64-bit.As for the compilation, the CPU code is compiled by Mathworks MATLAB 2019 or Visual Studio 2019, while the GPU code is compiled by NVIDIA CUDA 11.6.The heterogeneous parallel algorithms are implemented by the programming language C using CUDA and OpenMP, allowing developed modules can be used in software written in C++.Fig. 13 shows the interface of efficient parallel software, where parallel computing is used to solve the ITO problems:

Cantilever Beam
The cantilever beam is examined in this section to demonstrate the accelerated efficiency of the hybrid parallel strategy for ITO.The hybrid parallel strategy can be proved when the acceleration efficiency is higher than that of GPU.Fig. 14 shows the design domain of the 3D cantilever beam.The beam length, width, and height are set to 3 L, 0.2 L and L, respectively.The height L is set to 1, which follows the dimensionless quantity calculation rules.A unit-distributed vertical load F is applied downwards to the lower edge of the right end face while the left face is constrained.15 shows the three different environments to compare efficiency, i.e., CPU with MATLAB, GPU with CUDA and the hybrid CPU/GPU with both C and CUDA.The original implementation of ITO is based on MATLAB.C and CUDA are used to allow for parallelized acceleration due to lowlevel access to computer hardware.To illustrate the speed-up of the CPU/GPU heterogeneous parallel strategy, several sets of the cantilever beam problem different levels of quadratic NURBS elements are examined.The computational time of the ITO processes is shown in Table 8.The stiffness matrix assembly and the sensitivity analysis are executed in iterations of the solving processes, which shows that the parallel algorithm is more efficient than MATLAB.For the course mesh, the advantage of the hybrid over CUDA is not apparent enough.However, when the DOFs are up to million, each step for the heterogeneous calculation takes tens of seconds faster than CUDA and thousands of seconds faster than MATLAB.The speed-up ratio is obtained by comparing the hybrid computational time to others.As listed in Table 9, taking S 1 as an example, the speed-ups of the hybrid to MATLAB vary from 12.20 to 34.06, while the hybrid to CUDA are from 1.06 to 1.30.The CPU parallel computing capability is poorer than the GPU, and it is difficult for the hybrid CPU/GPU to get a large acceleration ratio compared to the single GPU.From the table, the speed-up ratio is up to 2.96 in S 4 .Note that under the current hardware conditions, MATLAB cannot solve the equations x = K/f at the mesh size of 270 * 90 * 18.The time consumption of GPU contains data transfer time and computation time.When the scale reaches a certain level and the computation time is larger than the data transfer time, the increasing computation makes GPU's parallel computing power fully utilized, which results in better acceleration.The GPU acceleration effect peaks with increasing scale, which causes the speed-up ratio in the table not to increase monotonically.Overall, from the data in the table, the speed-up ratio increases with the larger scale.The remarkable speed-up ratio proves the efficiency of the hybrid parallel algorithm, especially compared to MATLAB (achieving a speed-up ratio of 435.76 times).The ITO process based on the hybrid parallel strategy with the dynamic load balancing method can be further accelerated by utilizing the CPU and GPU parallel computing power.The time consumption and speed-up ratios for the stiffness matrix assembly, equation solving, sensitivity analysis and the update scheme are shown in Fig. 16.The advantage of the hybrid strategy is not apparent on a small scale, but the hybrid strategy efficiency increases with the increasing scale.The optimized results of the cantilever beam problem with different mesh scales are shown in Fig. 17, while all the cases yield consistent, optimized results.The color mapping reflects the value of the element density, increasing from blue to red in order.When the number of elements is small, the boundary part of the structure appears jagged, and the continuity between element densities is low.With the number of elements increasing, the boundary of the structure gradually becomes smooth, and no large color gaps appear, indicating a high numerical continuity between adjacent element densities, consistent with the characteristics of realistic material manufacturing.The time consumption of each process in the ITO iterations is shown in Fig. 18.In the CPU implementation with MATLAB, stiffness matrix assembly and sensitivity analysis are far more time-consuming than equation solving.However, in the hybrid, with the scale increasing, the time consumption ratios of stiffness matrix assembly decrease and become less than the equation solving.Compared to GPU, the hybrid main reduces time in stiffness matrix assembly and will achieve more significant results when the scale is larger.Thus, the efficiency of the hybrid parallel strategy for ITO is demonstrated.Then, equation solving will be the main time-consuming section in ITO.

MBB Beam
The MBB beam problem is to demonstrate the robust adaptability of the hybrid parallel strategy.Compared to the FEM-based TO, the IGA-based TO performs optimization analysis with higherorder NURBS elements, resulting in a significant increase in computational complexity and memory usage [24].Considering the time cost, the maximum DOFs of the cases are set to two million, which exceeds the handling capacity of the GPU.The design domain of MBB beam is shown in Fig. 19, where the length is 6 L, width and height are both L. A unit load F is applied downwards to the center of the upper-end face.The four corners of the lower end face of the MBB beam are constrained while one side is free in the horizontal direction.10.When the scale reaches a critical level, it will lead to the failure of the CUDA parallel method since the memory resources are consumed beyond the limitation of GPU.The NVIDIA GeForce RTX 3090 used in this paper has 24 GB memory, which has a large gap to the 120 GB CPU memory.Limited memory is a performance bottleneck when using GPU to accelerate solving large-scale problems.In this paper, the tasks can be appropriately assigned to CPU/GPU via the dynamic workload balancing strategy.The management and efficient use of GPU memory can be achieved based on determining the minimum corresponding dataset for GPU's tasks, which reduces the demand on GPU's memory.The optimized results of the 3D MBB beam problem are shown in Fig. 20.The 3D case with the mesh of 270 * 45 * 45 can only be solved by the hybrid method, while the memory allocation between CPU and GPU in each ITO process is shown in Table 11.The required memory in stages S 1 , S 2 , and S 4 is lower than the GPU memory.However, S 3 costs 26.9 GB, which exceeds the GPU's limitation.In comparison, the hybrid method can allocate memory properly between CPU and GPU, and maximize the utilization of local computing resources.

Wheel Beam
To demonstrate the accuracy of the proposed method, a 3D wheel beam problem is examined.The design domain is shown in Fig. 21.A unit external load F is applied to the center of the upper-end face, and the four corners of the wheel beam's lower-end face are constrained.The objective function values in ITO iteration are recorded in Table 12, and Fig. 22 shows the history of convergence for the CPU and the hybrid.The objective function values, i.e., compliance, decrease sharply in the beginning and smoothly converge over the iterations.The ITO process stops in the 132 th iteration for the CPU, and 132 th iteration for the hybrid.The iteration numbers are similar, while the results are illustrated in Fig. 23, which shows an identical structural topology.Table 13 records the relative errors between CPU and hybrid computing, while the history of relative error is shown in Fig. 24.In the 1-40 iterations, there is a significant fluctuation since double precision is utilized in the CPU method, while both double and single are used in the hybrid strategy to reduce memory consumption.After the 40th iteration, the relative error gradually becomes stable and stays below 0.0002.A hybrid parallel strategy for isogeometric topology optimization is proposed in this paper.Compared with the general GPU parallel strategy, the proposed method can improve computational efficiency while enhancing the ability for large cases.In the hybrid method, the tasks can be assigned to the GPU via the workload balancing strategy.Therefore, the local hardware resources can be fully utilized to improve the ability to solve large ITO problems.Four parts of ITO: stiffness matrix assembly, equation solving, sensitivity analysis, and update scheme, are accelerated by the hybrid parallel strategy, which shows significant speed-ups.
Three benchmark examples are tested to verify the proposed strategy.The 3D cantilever beam example demonstrates the high computational efficiency via the significant speed-up ratio over the CPU and GPU at different discrete levels.In the 3D MBB beam example, the method only using the device GPU cannot afford the amount of memory when it ups to a specified mesh scale.It shows the advantages of the hybrid parallel strategy in solving large ITO problems.Furthermore, the 3D wheel beam example demonstrates the accuracy of the hybrid parallel strategy.
Although the SIMP method is utilized in this paper, the proposed hybrid parallel strategy is highly general and equally applicable to other TO methods.In the future, distributed CPU/GPU heterogeneous parallel computing with multiple computing nodes will be researched based on the current work.

Figure 2 :
Figure 2: Schematic diagram for OpenMP/CUDA parallel programming model

Figure 5 :
Figure 5: Shared control point pair between elements

Figure 6 :
Figure 6: First phase in heterogeneous parallel computing of assembling stiffness matrix

Figure 7 :
Figure 7: Second phase in the heterogeneous parallel strategy of assembling stiffness matrix

Figure 8 :
Figure 8: Storage of sparse matrix in COO format

Figure 10 :
Figure 10: Data flow process between CPU host side to GPU

Figure 11 :
Figure 11: Workload balancing between the CPU and GPU by loading balance strategy

Figure 12 :
Figure 12: Dynamic workload balancing for CPU/GPU heterogeneous computing

Figure 13 :
Figure 13: Interface of an efficient parallel software

F=1Figure 14 :
Figure 14: Design domain and boundary conditions of 3D cantilever beam Fig. 15 shows the three different environments to compare efficiency, i.e., CPU with MATLAB, GPU with CUDA and the hybrid CPU/GPU with both C and CUDA.The original implementation of ITO is based on MATLAB.C and CUDA are used to allow for parallelized acceleration due to lowlevel access to computer hardware.To illustrate the speed-up of the CPU/GPU heterogeneous parallel strategy, several sets of the cantilever beam problem different levels of quadratic NURBS elements

Figure 15 :
Figure 15: Different environments for ITO implementing (a) Assembly stiffness matrix K. (b) Equation solving.(c) Sensitivity analysis.(d) The update scheme.

Figure 16 :
Figure 16: Time consumption and speed-up ration in ITO processes

Figure 17 :
Figure 17: ITO results of cantilever beam problem with different NURBS elements

Figure 18 :Figure 19 :
Figure 18: Time consumptions of IGA processes with different number of elements

FailFigure 20 :
Figure 20: ITO results of MBB beam problem with different NURBS elements

Figure 21 :
Figure 21: Design domain and boundary conditions of 3D wheel beam

Figure 22 :
Figure 22: Convergent histories of the wheel beam

Figure 23 :
Figure 23: Optimization results of the wheel beam

4 Figure 24 :
Figure 24: History of relative error between and Hybrid computing

Table 1 :
Phase 1 heterogeneous parallel algorithm for IGA Segment 1: Calculate the derivatives of the shape functions Input: Indices of elements idx, degrees of freedom (DOFs) of the element ed, Coordinates of control points P, Range of elements elU, elV, elW, Control point numbers cp, Knot vectors u, v, w, weights W, Coordinates of Gauss points Q, Number of Gauss points Ngs.

Table 2 :
Phase 2 heterogeneous parallel algorithm for IGA Number of elements in pair M el , Weights of Gauss points Wei, Number of Gauss points N gs , Derivatives of the shape functions d_dRdx.

Table 3 :
Algorithm for PCG method

Table 8 :
Time consumption for one iteration of ITO process in the cantilever problem (unit: s)

Table 9 :
Speed-up for one iteration of the topology optimization in the cantilever problem

Table 10 :
Memory usage of ITO processes in the cantilever problem (unit: GB)

Table 11 :
Memory allocation between host and device in the hybrid method (unit: GB)

Table 12 :
Objective function values in ITO iteration of CPU and Hybrid

Table 13 :
Relative error between CPU and Hybrid computing in ITO iteration