3D Model Occlusion Culling Optimization Method Based on WebGPU Computing Pipeline

: Nowadays, Web browsers have become an important carrier of 3D model visualization because of their convenience and portability. During the process of large-scale 3D model visualization based on Web scenes with the problems of slow rendering speed and low FPS (Frames Per Second), occlusion culling, as an important method for rendering optimization, can remove most of the occluded objects and improve rendering efficiency. The traditional occlusion culling algorithm (TOCA) is calculated by traversing all objects in the scene, which involves a large amount of repeated calculation and time consumption. To advance the rendering process and enhance rendering efficiency, this paper proposes an occlusion culling with three different optimization methods based on the WebGPU Computing Pipeline. Firstly, for the problem of large amounts of repeated calculation processes in TOCA, these units are moved from the CPU to the GPU for parallel computing, thereby accelerating the calculation of the Potential Visible


Introduction
In recent years, both computer and communication technologies are one of the main support services to realize the "Digital and Transparent Earth" [1].Geospatial information science and technology is composed of 3D geological modeling, visualization, and other related disciplines [2].However, it is very difficult for the 3D geological models to be visualized due to the large volume properties and complex hierarchical relationships.Especially for the rendering of large-scale 3D models on the Internet, many issues, such as the relatively poorer rendering performance, delayed displaying, and slower transmission of large amounts of data, still exist.Therefore, how to render and visualize graphics on the Internet with a higher efficiency has gradually become a hot research field in computer graphics.
In the scene of graphics rendering, different sizes of graphics often contain relatively significant occlusion relationships, thus only a small proportion of the corresponding scenes can be inspected from a specific viewpoint.Occlusion culling is the key technique for large scene rendering, which tries to cull out the invisible part of the scene and it will not have a great impact macroscopically.In this way, only the visible parts of the scene usually need to be transmitted to the GPU for rendering, which greatly reduces the rendering pressure of the computers to a large extent [3,4].
Research on occlusion culling can be divided into two ways: one is to depend on the off-line calculation, reducing the number of subsequent real-time rendering objects substantially; the other is to perform real-time calculations during rendering, thereby discarding the parts that do not need to be rendered in each frame.However, the former will lead to a long-time and low-efficiency calculation when the amounts of preprocessing are too large.The latter is a real-time dynamic calculation, which will occupy a large amount of memory and CPU resources, declining the machine's rendering capability.For the problems that existed in the traditional occlusion culling, starting from the breaking occlusion culling, this paper proposed three specific acceleration algorithms to accelerate the 3D model rendering according to reducing the time consumption of off-line preprocessing.
The rest of this paper is organized as follows.Section 2 briefly discusses related works.Section 3 describes the details of the proposed methods.Section 4 shows the experimental results.Finally, Section 5 concludes the paper.

Occlusion-Culling-Related Algorithm
Early in the 1950s, specialists from abroad had already started the research of 3D geological modeling techniques.Accompanied by the development of computer technologies, based on the 3D virtual environment, a variety of demands, such as spatial information management, spatial analysis, and prediction, geological statistics, and interpretation, could be satisfied, in which not only a fine 3D geological model is obtained, but also a variety of corresponding geological characteristics are better described [5].Thanks to the rapid development of computer graphics in recent years, a breakthrough has been made in both 3D visualization and rendering technologies, which plays an important role in utilizing 3D data and determining whether the data can be observed and used by human beings [6].For improving the computer's rendering efficiency and reducing the rendering pressure simultaneously, the concept of occlusion culling was proposed.
For example, Hierarchical Z-Buffering [7] is proposed in the 1990s, in which the space was separated by the Octree and down-sampled by the Z-Buffer.But each triangle in a certain scene needed to be checked caused a huge consumption of graphics processing.Reference [8] proposed the novel microarchitectural technique, the Omega-Test, in which the Coherent Hierarchical Culling (CHC) was introduced in the Z-Buffer algorithm, predicting the image visibility by using both Z-Test result information discarded by GPU and the frame-by-frame coherence.Reference [9] proposed the Coherent Hierarchical Culling algorithm, according to the spatiotemporal consistency between two adjacent frames, for the invisible objects in the previous frame, occlusion query is inserted without any rendering.By contrast, for the observed ones, their corresponding visibilities are updated according to the query, and the objects added in the adjacent frame will be rendered subsequently.Reference [10] have made some improvement on this basis, in which the objects observed in the previous stage are selected as the invisible ones, and the occlusion culling processes will be performed according to the depth of the given scene management tree.
Although these aforementioned methods could partly solve the problems of occlusion culling, some defects still exist, such as the CPU suspension and GPU starvation caused by occlusion query and the huge quantities of computing resources, especially for the cases of the visualization of the largescale complicated scene on the Internet [11].The 3D model visualization based on the web browser client, however, might bring a possible way to overcome these issues, which is more convenient and advanced than the traditional desktop [12].Specifically, for a network-based 3D visualization system at or above the city level, the amount of rendering data might reach the TB magnitude, in which the higher demands are proposed on the web browser clients in terms of rendering or data transmission.In recent years, the graphics computing units could be transferred from CPU to GPU owing to the outstanding computing performance.For example, the GPU-based point-cloud rending computation reduces the computational burden on the CPU, which improves the whole rendering efficiency with one magnitude of performance higher than the hardware pipeline [13][14][15].The problems of computing-intensive and slow rendering speed caused by the complicated 3D scene have always been one of the key factors for visualization.Reference [16] combined the visibility-judgement-based slice culling algorithm with the clipping algorithm, PVS-based culling algorithm, and ROAM strategy so that a new culling method with both improved ROAM and dynamic PVS is created.From the perspective of the graphics, it improves the calculation and retrieval efficiency of the geoscience data.However, due to the highly non-linear, complex, and nonstationary properties of the 3D geological models, how to render them with a higher efficiency still needs to be improved.Therefore, it becomes a possible way to solve this issue according to accelerating the computation speed by using the GPU units.

Occlusion Culling
For the given 3D scene and viewing angle, the occlusion relationships could be judged and the invisible graphical objects will be discarded so that it can reduce the complexity and improve the realism for the whole scene, thereby realizing the low-load creation and network transmission, which is shown in Fig. 1 below [17].
The common classifications for occlusion culling are described as follows [18]: Based on the granularity, the occlusion can be divided into Object precision with object level and Image precision with pixel level.
According to the viewpoint, the occlusion culling can be divided into Point-based ones and Cellbased ones, which are calculated point by point and region by region respectively.
Based on the organization of the occlusion objects, they could be classified into Cell-and-Portal ones and Generic Scene ones.For the former one, the region will be divided into many cells, and different viewing frustums are rendered according to different cell structures and portals' locations.For the latter one, there is no limitation of the occlusion objects.
According to the calculation-occurred time, it includes break-time and real-time occlusion culling.The pre-processing is conducted before rendering in the former one, while both the rendering and occlusion culling happens simultaneously in the latter one.

Break-Time Occlusion Culling Algorithm
For improving the frame rate of the real-time rendering, the reductant computing resources should be moved to the pre-processing steps in the Break-time Occlusion Culling Algorithm (BOCA) so that the Potentially Visible Sets (PVS) could be calculated in advance with a relatively lower rendering pressure [19].The PVS refers to the candidate set of the non-occluded objects in the 3D scene obtained by the Break-time computing.PVS could finish the occlusion culling quickly when creating the 3D scene [20].As is shown in Fig. 2, the white part is the objects of PVS in the current viewpoint, which are transmitted to the GPU to perform the rendering.In the same way, the objects in the gray part can be viewed as occlusion objects, which will be removed from the same viewpoint.
The main process of BOCA includes four parts, named: Building the scene tree, intersection testing for overall rays iteratively, obtaining the PVS, and real-time rendering.The flow chart of BOCA is shown in Fig. 3: Pre-processing aims to get the PVS sets.For each viewpoint, many light rays via the objects are projected into it correspondingly so that the visible objects could be obtained.Therefore, for different viewpoints, many rays are launched from the corresponding coordinates to substitute direct-incident in the real environment.Then, the intersection test between the objects and rays in the scene is conducted.After that, the object initially intersected with each ray is selected and saved in the PVS.When all the rays are traversed in a certain viewpoint, the corresponding PVS is established.Repeating the operation many times and the PVS of all viewpoints is obtained.For the BOCA, especially in the Web client, the problems of an unstable computing environment, longer computing time, and huge resource consumption are more significant.Specifically, when the BVH tree is constructed, parallel computing might exist between the intersection testing and distance calculation from the viewpoint to the selected bounding box (Dis (Viewpoint → BoudingBox)), while these computing processes are finished on the CPU with a serial computing way instead, which brings a relatively lower computing efficiency and higher rendering pressure on them.For the GPU, however, with an excellent parallel computing capability, such computing processes could be performed on them.Therefore, both two steps are performed on the GPU so that it cannot only reduce the computing time but also improve the computing efficiency to some extent.Therefore, this paper proposed three occlusion culling algorithms based on WebGPU, including the Improved WebGPU-Computing-Pipeline-based Algorithm (IWCPA), the BVH-based Algorithm (BVHA), and the Strategy-Adjustment-based Algorithm (SAA).These algorithms will be described specifically in the next chapter.

WebGPU
A Graphics Processing Unit (GPU) is initially used as the electronic subsystem for graphics processing.For the unique architecture, however, a variety of algorithms could be implemented by the developers, named GPU Computing.WebGPU is a new graphics API that can bring all the functions of the GPUs to Web browsers [21].Reference [22] used WebGPU to implement a peer-topeer cluster, and matrix multiplication and Mandelbrot sets are adopted to evaluate the performance.The experimental results show that the problem of parallel computing can be extended, with a 75% improvement in efficiency.Therefore, the WebGPU is used to optimize the BOCA so that the time consumption of the rendering might be relatively reduced.
Before the concept of the pipeline is proposed, point-to-point information transmission could be conducted between CPU and GPU with a more frequent communication level.Being limited to the higher utilization frequency, the computing abilities of GPU cannot be fully presented.Fig. 4 shows the transmission speed among different memories.The WebGPU introduces the conception of rendering and computing pipeline so that the extra consumption between CPU and GPU is reduced.Meanwhile, Resource Binding is also introduced into it so that the global sharing of data among different pipelines could be possibly realized.The multi-dimensional data processing could make resource binding more convenient.Therefore, the WebGPU is used as the graphic API.

Overall Architecture
Based on the WebGPU characteristics, the IWCPA is first proposed.Then, for the problem of huge overhead caused by the computing pipeline creation in the IWCPA, the BVHA is adopted.In practice, it can be found that a large time consumption is cost by the BVH traversing process in the BVHA, so the SAA is finally proposed.By traversing the BVH on GPU, SAA greatly improves the efficiency of this process.The flow chart of the three methods is shown in Fig. 5: For the IWCPA, all the computing units are removed to the GPU with a relatively lower computing pressure on the CPU.The object selection is performed without the BVH, and although the parallel computing for the GPU is relatively stronger, all the objects in the scene cannot be obtained effectively in a very short period.Therefore, the BVH tree is introduced in the BVHA, in which the Dis (Viewpoint → BoudingBox) calculation is removed to the GPU, and the rest computing parts are still conducted on the CPU.However, the computing pressure of the CPU is much higher and the utilization of the GPU is relatively lower.Therefore, this article introduces the SAA, which can also be viewed as the combination of the aforementioned two methods, with a balance of utilization between the CPU and GPU.Specifically, the BVH is established in the CPU and the intersection testing is performed on the GPU, then the Dis (Viewpoint → BoudingBox) is calculated, which can improve the computing efficiency to a large extent.

BVH Scene Tree and Slab Algorithm
During the intersection texting for all the objects in the scene, each ray of the viewpoint should be calculated, with huge consumption of computing resources and time overhead.Therefore, for a certain scene, the space could be divided into many subspaces and calculated separately.Such the space division pattern could be regarded as the scene management tree.The common scene management tree includes Octree, KDtree, and so on.
BVH tree uses the bounding box with a simple geometric shape to approximately describe the complicated geometric objects, which can reduce the complexity of the 3D geological models and reduce the subsequent calculations [23,24].It overcomes the disadvantages of low searching efficiency in Octree due to the spatial unbalanced aggregation for 3D geological models, which causes the depth of the Octree might be much deeper than usual [25].Although the KD tree is better in searching efficiency, owing to the huge volume of characters of 3D geological models, it occupies a much larger computing memory [26].Therefore, the BVH tree is adopted to manage the objects in the scene.
As is shown in Fig. 6, it shows the geological model and the corresponding BVH visualization in the southwestern region of Guizhou Province.For the constructed BVH tree in this paper, only the leaf nodes of the tree will contain the real object and non-leaf nodes will not contain it.Meanwhile, the Axis-aligned Bounding Box (AABB) is adopted to construct the BVH tree, which could be regarded as the smallest bounding box for the internal objects.Due to its simple geometric information, it can reduce the calculation process and ensure calculation precision, which reduces the probability of culling models in visible regions by mistake.Fig. 7 shows the BVH construction in the 2D space.Fig. 7a represents the corresponding bounding box, with serial numbers 1, 2, 3, and 4. Firstly, an AABB is constructed to surround all the bounding boxes, as shown in Fig. 7b.Then, the center coordinates with the longest axis are selected and divided into two nodes (bounding box), as is shown in the red part of Fig. 7c.Such iterations are repeated many times until all the bounding boxes of the real objects should correspond to a leaf node, as is shown in Fig. 7d.Among various intersection testing methods, the slab is used in this paper [27].The key to the slab is that the bounding box is regarded as the space in three pairs of parallel planes.If the ray is sandwiched by every pair of parallel planes and any part of them are remained, it will intersect the bounding box.For example, as is shown in Fig. 8, O is regarded as the viewpoint, and light is launched from it.A, B, C, and D represent the intersection point of the ray with planes X 0 , Y 0 , X 1 , and Y 1 respectively.The red point and green point represent the input and out ray's location of corresponding planes respectively, and t represents the distance between the certain point on the ray and viewpoint O. Slab judges the intersection between ray and plane according to whether the maximum value of t at which the ray enters the plane (as is shown in t 0 of Fig. 8) is smaller than the minimum value of t at which the ray leaves the plane (as is shown in t 1 of Fig. 8).

Figure 8:
Semantic diagram of the slab algorithm.Among them, t 0 refers to the maximum distance between the viewpoint and the intersection ray leaving the plane, while t 1 refers to the minimum distance between the viewpoint and the intersection ray entering the plane

Improved WebGPU-Computing-Pipeline-Based Algorithm
The key to IWCPA is that parallel computing is performed between the ray and the bounding box of the objects in the scene, with a 2D resource binding so that the calculation could be easily understood.The schematic diagram for parallel computing is shown in Fig. 9.During the 2D parallel computing of WebGPU, each calculation unit could be viewed as a parallel process, and both the ordinate and abscissa can be regarded as the ray and the bounding box of each corresponding object respectively.Each rectangular is viewed as a computing unit, and intersection testing could be performed on it.In the experiment, the 16 × 16 computing units are adopted to perform the parallel computing and 256 computing units could be finished at one time.In contrast, such a process could be finished according to serial computing 256 times for CPU, so the computing performance is significantly improved.The main steps for IWCPA are as follows: firstly, the bounding box should be created according to the rendering requirements and meshes.After creating the AABB (as shown in Fig. 6), the information for both ray and AABB are transmitted into GPU, and the slab algorithm is used to perform the intersection testing between each ray and all bounding boxes.Take a certain ray as an example, if the ray intersects with one of the bounding boxes, then the distance between them will be returned.Otherwise, it will return an invalid value of −1.When the intersection testing is finished, the bounding boxes will be discarded with the return value −1, and others will be sorted by distance.The meshes which have the closest distance with the viewpoint will be input into PVS.Such operation will be repeated many times and finally, the PVS corresponds to the viewpoint that could be obtained, and finally, the corresponding data will be loaded and rendered by the rendering pipeline.The pseudocode of the IWCPA steps is shown in Table 1.BVHA is combined with the IWCPA and BOCA.The BVH is firstly established to realize the object management in the scene and traverse the BVH to perform intersection testing.Based on it, the computing pipeline will be created, and all the nodes obtained by intersection testing will be input into the pipeline so that the Dis (Viewpoint → BoudingBox) will be calculated on GPU and PVS will be obtained finally.
In the CPU, when traversing the non-leaf nodes, if it intersects with the ray, then continue traversing the corresponding child nodes and judging whether they intersect with the ray.When traversing the leaf nodes, if it intersects with the ray, then the corresponding internal 3D model will be recorded and input into PVS.Finally, the intersection testing for PVS is performed on GPU according to the slab algorithm.
After calculation on GPU, the rays derived from the viewpoint might intersect with many objects, but, due to the occlusion relationships among different objects, the first one intersected with the ray could be observed.Therefore, the nodes obtained by the intersection test will be further selected, and the objects first intersect with the ray will be input into PVS.Then, for the BVHA, the Dis (Viewpoint → BoudingBox) will be calculated on GPU, and these nodes are classified and compared with each other according to different rays.Finally, the PVS could be obtained to perform rendering.The pseudocode of BVHA steps is shown in Table 2.

Strategy-Adjustment-Based Algorithm
The main process for SAA is that the computing pipeline is created after constructing BVH.The nodes in the BVH are passed into the computing pipeline in terms of an array according to the order of depth traversal, named Arr a .An array named Arr b with the same length as Arr a is also passed in, and the elements in the Arr b represent the number of sub-nodes corresponding to the elements of Arr a .During the subsequent intersection testing, if one node cannot pass, then the subsequent testing will be canceled; if passed, the depth-first traversal will be continued.256 calculating units are adapted to perform parallel computing, and each unit represents the depth traversal calculation for the BVH tree.
After constructing the AABB and BVH, the subsequent traversal, calculation, and rendering will be performed on GPU.Because the data cannot be transmitted in the form of tree structures, so traversing the BVH in depth-first order, and the results will be restored in the list, and intersection testing will be performed on GPU, which is shown in Fig. 10.If the node passed, continue to judge whether the corresponding sub-nodes exist.If not existing, then the Dis (Viewpoint → BoudingBox) will be calculated and recorded.If existing, then continue traversing.By contrast, if the node is not passed, then the subsequent traversing process will be canceled.After parallel calculations are performed, the nodes with minimum distance will be input into PVS and rendered.The pseudocode of the SAA steps is shown in Table 3.In this paper, five 3D geological models with different scales are selected to verify the rendering efficiencies.The original models are visualized in Fig. 11, and the corresponding statistical information and experimental environment are shown in Tables 4 and 5 respectively.Table 6 represents the rendering time statistical results of five models based on different algorithms.Among them, due to the large scale of models 4 and 5, both of them will cause huge time consumption in TOCA and BOCA, so they will not participate in rendering time comparison analysis.As is seen, the IWCPA rendering efficiencies for all models are greater than the ones of TOCA.However, compared with the ones of BOCA, the rendering efficiencies improve by 54.1%, 87.2%, 89%, 97.3%, and 90.5%.The BVHA rendering time is relatively longer for five models, in which the efficiencies improve by 48.5%, 4.4%, −26.2%, 21.8%, and −0.5% respectively ("-" means the efficiency reduction) respectively compared with the ones of BOCA.SAA has the fastest rendering speed for models 1, 2, and 3, but the rendering speed is relatively slower than the ones of IWCPA for other models, in which the efficiencies improve by 88.74%, 94.04%, 93.49%, 96.64%, and 88.65% respectively compared with BOCA.As a result, SAA has the highest rendering efficiencies for large-volume 3D models, and the IWCPA rendering times consumption cannot be impacted by the model's volume.In addition, the coordinate (40000, 0, 0) was selected as the viewpoint V .The model's center was placed on (0, 0, 0), and 10,000 rays were launched uniformly from V in the visible region according to the length-width ratio of the screen.Fig. 12 represents the results of both occlusion culling and completely rendering ones from the angle of V and overlook respectively.As is shown, the occlusion culling could reduce the rendering numbers, which the result is similar to the complete rendering ones.Meanwhile, the rendering effects of the three proposed methods are the same, because the rendering results mainly depend on the ray distribution and the bounding boxes, and the three proposed methods have the same spread of rays and configuration of bounding boxes, with the same rendering results.Take model 4 as an example, based on V , the corresponding rendering times are compared and visualized in Fig. 13, in which time consumption is the same in the three proposed methods.Among them, the rendering time for red regions is longer, with more PVS numbers, while the one for green regions is shorter, with fewer PVS numbers.The geological and detailed texture information of the original models is rendered effectively.It indicates that the occlusion culling algorithms mainly focus on the PVS rendering, which can reduce the rendering time consumption of invisible parts, accelerating the whole rendering efficiency with a relatively lower accuracy loss.In this paper, due to the problem of longer rendering time for large-scale 3D geological models, the IWCPA, BVHA, and SAA are proposed.For IWCPA, all the bounding boxes and rays are input into GPU to perform parallel computing.The scene management tree is introduced in BVHA, in which the BVH is established to perform intersection testing on the CPU, then the Dis (Viewpoint → BoudingBox) will be calculated.For SAA, after constructing the BVH on the CPU, the intersection testing and Dis (Viewpoint → BoudingBox) calculation will be performed on GPU.Five 3D geological models with different scales were selected to verify the experiment.The results show that the SAA has the fastest rendering speed for large-volume models, and rendering time for BVHA will not be impacted by the model's size.The drawbacks still exist in these algorithms.For example, some important information about 3D geological models might be excluded, and the parameter settings for the WebGPU pipeline are complicated.However, the initial position of the viewpoint can be user-defined, which means the users can focus on the geological details by moving the viewpoint closer to the information they are interested in.By doing this, the detailed information on the sight is rendered, while the one far to the viewpoint is possibly excluded.In the future, scene management tree optimization and machine learning will be introduced to further improve the algorithm efficiency.
Fan, Gang Liu, Genshen Chen.All authors reviewed the results and approved the final version of the manuscript.

Figure 1 :
Figure 1: The schematic diagram of the occlusion culling algorithm

Figure 4 :
Figure 4: The bandwidth speed with different memory

Figure 5 :
Figure 5: The thought flow chart of three methods

Figure 6 :
Figure 6: Geological models from different perspectives and their corresponding bounding boxes in southwestern Guizhou

Figure 7 :
Figure 7: The build of the BVH Tree

Figure 9 :
Figure 9: The schematic diagram of the concurrent calculation of IACPW

Figure 10 :
Figure 10: The schematic diagram of the concurrent calculation of SAA

Figure 12 :
Figure 12: (a) Effect comparison figure from viewpoint; (b) Effect comparison figure of elimination effect

Figure 13 :
Figure 13: The ratio picture of rendering time consumption

Table 1 :
Pseudocode of IWCPA Algorithm IWCPA Input: the information of meshes, the data of 100 * 100 rays created by viewpoint raytarget Output: a box array consisting of boxes that access intersection test ans Create AABB bounding box for each mesh box = AABB box array //set the number of GPU parallel computing units @workgroup_size(16, 16) //do intersection tests in the GPU and output in ans array ans = rayout(box, raytarget) return ans; 3.4.2BVH-Based Algorithm

Table 2 :
Pseudocode of BVHA Algorithm BVHA Input: the information of meshes, the data of 100 * 100 rays created by viewpoint raytarget Output: a box array consisting of boxes that access intersection test ans // Do intersection tests in the GPU and output in ans array ans = rayout(tmpbox, raytarget) return ans;

Table 3 :
Pseudocode of SAA 4 Experimental Results

Table 5 :
Hardware environment

Table 6 :
Time consumption statistics