The whole-genome survey of Acer griseum, its polymorphic simple sequence repeats development and application

Background: Acer griseum Pax is an endangered species endemic to China with both ornamental and economic value. However, the lack of information on its genome size and characteristics hinders further work at the genome level. Methods: This paper applied bioinformatics methods to predict the characteristics and patterns of the A. griseum genome, providing an important basis for formulating its whole-genome sequencing scheme. This study also characterized the simple sequence repeats (SSRs) of A. griseum, laying the foundation for the development and application of genome-wide SSR markers. In this study, PE150 sequencing was performed by the BGI MGISEQ platform, and the sequence files were analyzed by the K-mer method to estimate the characteristic information by GCE software. Results: The genome size was finally determined to be 739.63 Mb, its heterozygosity ratio was 1.33%, and the repetition ratio was 65.68%. A total of 825,960 SSR loci were identified in the assembled genome sequence, and primers were successfully designed for 526,020 loci. To verify the effectiveness of these primers, 100 pairs of primers were randomly selected and synthesized, and 81 pairs successfully amplified the target fragments. Fourteen pairs of primers with good polymorphism were selected for principal component analysis of 31 A. griseum individuals from two populations, showing favorable heterozygosity and PIC values. According to the findings, these SSRs might identify genetic variations based on geographic areas. Conclusion: It is suggested that Illumina + PacBio assembly strategy should be used for whole-genome sequencing due to the high heterozygosity rate and high repetition rate of the genome. In addition, the SSR primers designed in batches in this study laid a foundation for the in-depth study of population structure and population maintenance mechanism of A. griseum, which is helpful for the effective conservation and sustainable utilization of this germplasm resource.


Introduction
Acer is a group of woody plants that combines ornamental value with economic use and is classified as Sapindaceae in the latest APG IV system (Angiosperm Phylogeny Group et al., 2016). As a large genus containing more than 160 species, the intra-genus classification of Acer has not been uniformly concluded, and it still needs to be revised and requires multi-level studies such as fossil evidence, anatomy, morphology, and molecular biology (Grimm et al., 2006). Paperbark maple (Acer griseum Pax) is an endangered species endemic to China (Fang, 1981;Aiello and Crowley, 2019). It is mainly distributed in sparse forests at an altitude of 1500-2000 m in Southwestern Henan, Southern Shaanxi, Southeastern Gansu, Western Hubei, and Eastern Sichuan. Paperbark maple has high value and many uses. Its hardwood can be used to make a variety of valuable implements, and the bark has good fiber content to make rope and paper. In addition to the economic value, paperbark maple trees are beautiful, have a high ornamental value, and are a relatively rare green tree species in the garden (Fang, 1981;Fu, 2020).
So far, the research on A. griseum has focused on germplasm resources investigation, asexual propagation, seed propagation, new variety cultivation, and chemical composition (Maynard and Bassuk, 1990;Chen et al., 2013;Fu, 2020). At the molecular level, Sun (2014) developed 27 polymorphic simple sequence repeats (SSR) molecular markers by constructing a magnetic bead enrichment library, used 11 pairs of primers to study the genetic structure and genetic diversity of A. griseum populations, and preliminarily discussed the possible reasons for the decrease of them. Researchers have also sequenced the chloroplast genome of A. griseum, and studied the populations using cpDNA primers and comparing chloroplast genomes (Wang, 2015;Wang et al., 2017;Ye et al., 2017;Fu, 2020). However, the lack of information on the genome size and characteristics of A. griseum has hindered further work at the genomic level. Therefore, the whole genome sequence of A. griseum is necessary, which will help to reveal its phylogeny and resistance mechanism at the molecular level and provide scientific support for its genetic conservation and rational use of economic value.
Genome survey sequencing based on NGS (nextgeneration sequencing) technology can be used to costeffectively assess genomic information such as heterozygosity levels, genome size, and repetitive sequence content, and can be used to develop molecular markers on a large scale. In this study, we aimed to predict the genomic characteristics of A. griseum by NGS technology and then identify SSRs from the genome survey sequencing for microsatellite marker development. This study will provide a framework for the whole genome sequencing in the future and will be useful for subsequent population genetics and molecular species identification of A. griseum.

Materials and Methods
Experimental materials and DNA extraction Plant samples were collected from two wild populations of A. griseum in Longyuwan National Forest Park (LY) (33°41′42′N, 111°47′48′E; n = 17) and Duhuigou Ecotourism Area (DH) (34°6′30′N, 112°26′51′E; n = 14) in Henan Province of China. The voucher specimen of A. griseum was stored in the Herbarium of Luoyang Normal University (deposition number: BOT21063). Adult healthy A. griseum leaves were selected and put into a sealed bag. The leaves were dried and preserved by silica-gel desiccant. DNA was extracted from dried leaves by the CTAB method (Clarke, 2009).
Sequencing data and quality control The qualified DNA samples were randomly interrupted into 300-500 bp long fragments by Covaris ultrasound instrument, and the whole library was prepared by the steps of end repair, A-tail addition, sequencing adaptor addition, purification, polymerase chain reaction, and so on. The constructed library was sequenced by the PE150 on BGI MGISEQ platform. Raw image data files obtained by highthroughput sequencing were analyzed by base calling and converted into raw reads in FASTQ format.
To improve the accuracy of the data, we used SOAPnuke software to filter all the raw reads and obtain clean reads (Chen et al., 2018). The main parameters were -low Qual = 20, -n Rate = 0.005, -qual Rate = 0.5, and other parameters by default. The data were processed as follows: (1) elimination of duplicated reads caused by PCR amplification and other related reasons, (2) removal of the paired reads with connectors, (3) removal of the paired reads with N ratio exceeding 0.5%, and (4) removal of the paired reads with low quality.
The filtered high-quality data were randomly selected from 10,000 pairs of Reads data and compared to the NCBI nucleotide database (NT) by the Basic Local Alignment Search Tool (BLAST) software (Altschul et al., 1990) to evaluate any possible contamination of the samples.
K-mer analysis K-mer analysis was used to estimate genome size and heterozygosity as well as repetitive sequence information by pairing sequence files through the GCE software (Liu et al., 2013). In this study, K = 17 was selected for the analysis, thus ensuring that a sufficient number of K-mer were generated to cover the entire genome.
Simple sequence repeats analysis and validation SSR loci were searched for assembled genomic sequences using MISA1.0 with parameters set to 1-10, 2-6, 3-5, 4-5, 5-5, 6-5 (e.g., 1-10, with a minimum number of repeats of 10 to be detected when mono-nucleotide is the repeat unit) (Beier et al., 2017). In addition, the distance between two SSRs was set to at least 100 bp. If the distance between two SSRs should be greater than or equal to 100 bp, otherwise treat them as one SSR marker. The obtained SSRs in A. griseum were analyzed in three ways: analysis of the microsatellite composition of the genome, the distribution of SSRs, and the dominant repeat motif types.
Primers were designed in the flanking region of the SSR loci using Primer Premier 3.0 software (Untergasser et al., 2012). To verify the validity of these primers, a total of 100 primer pairs were synthesized and verified by PCR amplification in 20 individuals of A. griseum (ten random samples were taken from each population of LY and DH). The PCR procedure was performed in a 15 μL volume containing 7.5 μL 2 × PCR mix (Tiangen, Beijing, China), 20 ng genomic DNA, and 0.25 μM forward and reverse primers under the following conditions: denaturation at 94°C for 5 min, followed by 30 cycles of denaturation at 94°C for 50 s, annealing for 45 s, 72°C for 30 s. The amplified products were subjected to 10% polyacrylamide gel electrophoresis and developed by silver staining. Principal components analysis (PCoA) Fourteen pairs of SSR primers with good polymorphism were selected for genetic analysis of 31 individuals from LY and DH populations. The amplified data were input into GenAlEx V6.5, and PCoA was performed according to the genetic distance between different individuals (Smouse and Peakall, 2012).

Evaluation of sequencing quality
The raw and filtered clean data obtained by sequencing are shown in Table 1. High-quality sequencing data were submitted to the NCBI (Registration Number: PRJNA881718). The BLAST results showed that the top five species comparisons with A. griseum were A. pentaphyllum (1.76%), Xanthoceras sorbifolium (0.745%), A. yangbiense (0.695), Pistacia vera (0.545%), and A. triflorum (0.515%). Results indicate that the data generated are reliable and accurate, without exogenous contamination, and can be utilized for further research. Q20 (%) and Q30 (%), respectively, refer to the percentage of bases with Phred values greater than 20 and 30 in the total base.

K-mer estimates genomic information
The genomic characteristics of A. griseum were analyzed using the K-mer analysis. The value of K was set to 17, and the total number of K-mer was 64,956,696,712. In the K-mer depth distribution shown in Fig. 1, the first peak is located at 41×, which is a heterozygous peak. The dominant peak is located at 82× and has a 2:1 relationship with the first peak, which is similar to the standard peak shape for diploids.
The genome size was the total number of K-mers /K-mer depth. The K-mer depth was calculated from the K-mer distribution curve using GCE software. The final genome size was 739.63 Mb, with a heterozygosity ratio of 1.33% and a repetitive sequences ratio of 65.68%. The software SOAPdenovo2 (Luo et al., 2012) was used for the preliminary assembly of the sequencing data, and the results are mentioned in Table 2.

Genomic simple sequence repeats composition analysis
In this study, a total of 825,960 SSR loci were detected in the 1005.2 Mb genome sequence of A. griseum, suggesting that one SSR locus appeared in 1217.1 bp on average.
The statistics of different SSR types show that there were 509,385 mono-nucleotide types, accounting for 61.67% of the total, followed by 147,979 di-nucleotide (17.92%), 37,236 trinucleotide (4.51%), 10,173 tetra-nucleotide (1.23%), 3,270 penta-nucleotide (0.40%), and hexa-nucleotide type had 3,136 or 0.38% of the total. There were 114,780 composite SSRs, accounting for 13.90% of the total SSRs. The distribution of different types of SSR motifs in A. griseum is mentioned in Fig. 2, and the specific major motifs are shown in Table 3.

Simple sequence repeats validation and principal component analysis
Among all 825,960 SSR loci, 526,020 were successfully designed with primers. A total of 100 primers were randomly chosen for validation in 20 DNA samples of A. griseum, 81 of which could amplify the target band (Suppl .  Table S1). To further identify the effectiveness of these primers, 14 primers with good polymorphism and stable detection results were selected for PCoA analysis of 31 A.  griseum individuals from two populations. SSR-PCR amplification profiles are shown in Fig. 3. Seventeen individuals from the LY population clustered together and 14 individuals from the DH population clustered together (Fig. 4). The two coordinates represent 22.83% and 16.34%, respectively, of the overall genetic variation. These results indicated that SSR markers developed based on genomic research could identify genetic variation among populations of A. griseum in different geographical locations. Details of the 14 primers are shown in Table 4.

Discussion
Before conducting whole genome sequencing work in plants, it is important to assess the genome size and complexity to develop a sequencing protocol. At present, the current methods for genome size determination include flow cytometry and genome survey analysis Zhou et al., 2018). When plant genome size is determined by flow cytometry, differences in the results of the same species can occur due to different operations and testing conditions (e.g., lysis methods and internal standard selection) during the test (Doležel et al., 2007;Lin et al., 2019). Genome survey based on next-generation highthroughput sequencing technology has the advantage of obtaining a large number of gene sequences while determining genomic features and is increasingly valued and applied by researchers (Kirkness et al., 2003;Yang et al., 2022).
It is commonly believed that the greater the degree of heterozygosity and the more repetitive fragments in a species genome, the more difficult it is to assemble (Xu et al., 2020). Assembly is considered difficult if the heterozygosity is higher than 0.5% and more difficult if the heterozygosity is higher than 1%. In this study, the heterozygosity rate of A. griseum predicted by the genome survey was 1.33%, and the percentage of repetitive sequences was 65.68%. In case of large heterozygosity and high repetition rate, the de novo assembly of the genome is difficult, and it is recommended to follow up with Illumina + PacBio sequencing assembly strategy for whole genome sequencing.
In this study, different SSR types and the dominant motif of A. griseum were analyzed. The comparative analysis of SSR markers of A. griseum and its relatives showed that the number of mono-nucleotide SSR types was the highest among A. griseum, A. miaotaiense, and A. rubrum, followed by di-nucleotide types. Among them, the di-nucleotide SSR motif of A. griseum was dominated by AT/TA, the same as that of A. truncatum and A. rubrum, and different from that   of A. miaotaiense and A. davidii with AG/CT as the main motif. Among the tri-nucleotide SSRs, the main motif of A. griseum, A. truncatum, and A. miaotaiense was AAT/ATT, while A. davidii and A. rubrum were GAA/TTC Wang et al., 2019;Guo et al., 2021;Mu et al., 2021). The differences in SSR types and main motifs of these plants may be related to the biological characteristics of the species, or different sequencing platforms, search criteria, and other factors. The expected heterozygosity obtained using microsatellites was in the range of 0.3~0.8 for high genetic diversity (Takezaki and Nei, 1996;Zhou et al., 2022). The average observed heterozygosity (Ho = 0.37) and expected heterozygosity (He = 0.5) of the 14 polymorphic SSR loci in the study were consistent with the criteria set for high genetic diversity. When Bostein et al. (1980) proposed the polymorphic information content index (PIC) for measuring the degree of gene variation, they considered a locus to be highly polymorphic when PIC > 0.5, moderately polymorphic when 0.25 < PIC < 0.5, and less polymorphic when PIC < 0.25. In this study, all 13 loci except for the AG12 locus were moderately or highly polymorphic. PCoA is a powerful tool for assessing population genetic variation. The results indicate that the polymorphic SSR markers proposed in the study can be effectively used for the genetic analysis of A. griseum, laying the foundation for genetic variation research and effective conservation of this species.
Funding Statement: This work was supported by the National Natural Science Foundation of China [Grant No. 31870697].