# Pipeline for NORG identify and synteny

## Genome resources

A total of 22 genomes covering AA, BB and FF genome species in the _Oryza_ genus were collected, including 16 high-quality genome assemblies covering 15 subpopulations of Asian cultivated rice (Kawahara et al. 2013; Zhou et al. 2020; Song et al. 2021) and six genome assemblies of other species (_O. glaberrima_, _O. barthii_, _O. glumaepatula_, _O. meridionalis_, _O. punctata_ and _O. brachyantha_) (Stein et al. 2018; Ntakirutimana et al. 2023; Thathapalli Prakash et al. 2023). The genome sequences in FASTA format were obtained for each assembly. The genome sequences and gene annotations of _O. sativa_ Nipponbare were downloaded from Ensembl Plants (http://plants.ensembl.org/). The genome sequences and gene annotations of other _O. sativa_ accessions were downloaded from RGI (https://riceome.hzau.edu.cn) (Yu et al. 2023) and the assembly data of CHAO MEO::IRGC 802731 (GCA_009831315.1), Azucena (GCA_009830595.1), KETAN NANGKA::IRGC 19961-2 (GCA_009831275.1), ARC 10497::IRGC 12485-1 (GCA_ 009831255.1), PR 106::IRGC 53418-1 (GCA_009831045.1), Minghui 63 (GCA_001623365.2), IR 64 (GCA_009914875.1), Zhenshan 97 (GCA_ 001623345.2), LIMA::IRGC 81487-1 (GCA_009829395.1), KHAO YAI GUANG::IRGC 65972-1 (GCA_009831295.1), GOBOL SAIL (BALAM)::IRGC 26624-2 (GCA_009831025.1), LIU XU::IRGC 109232-1 (GCA_009829375.1), LARHA MUGAD::IRGC 52339-1 (GCA_009831355.1), N22 (N 22::IRGC 19379-1) (GCA_001952365.2), NATEL BORO::IRGC 34749-1 (GCA_009831335.1) were deposited in GenBank of NCBI GenBank (https://www.ncbi.nlm.nih.gov/datasets/genome/). Genome sequences for the remaining assemblies of _O. glaberrima_::IRGC 96717 (GCA_000147395.3), _O. barthii_::IRGC 105608 (GCA_000182155.4), _O. glumaepatula_ (GCA_000576495.2), _O. meridionalis_::OR44 (W2112) (GCA_000338895.3) , _O. punctata_:: IRGC 105690 (GCA_000573905.2), _O. brachyantha_::IRGC 101232 (GCA_000231095.3) were downloaded from NCBI   GenBank (https://www.ncbi.nlm.nih.gov/datasets/genome/https://www.ncbi.nlm.nih.gov/).

NC_001320 (NCBI GenBank) was used for plastid genome and NC_011033 (NCBI GenBank) was used for mitochondrial genome as the query sequence, respectively. 

## NORG identity

We used `NUCmer` (Marçais et al. 2018) for sequence alignment (version 4.0.0beta2, parameters: `-c 50 --maxmatch`). NC_001320 (NCBI GenBank) was used for plastid genome and NC_011033 (NCBI GenBank) was used for mitochondrial genome as the query sequence, respectively. All anchor matches, regardless of their uniqueness, were used (parameter: `--maxmatch`), and the minimum length of a cluster of matches was set to 50 (parameter: `-c 50`). These parameters allow the inclusion of shorter matches and retain more matches than the default settings. The overlapping hits were merged by BEDTtools (Quinlan and Hall 2010) `merge` command and the results were utilized for calculating the total length of the NORGs. Then, the alignments separated by < 5 kb of DNA of non-organelle origin were merged by BEDTools `merge` command, and the results were temporarily used as NORG clusters.

script: `norg_identify_pipeline_opt.bash`

requirement:
- BEDTools
- MUMmer 4
- Perl

## Synteny of NORGs

Each NORG cluster including its flanking sequences of 300 bp was aligned to all genomes using `NUCmer` with the default parameters, followed by `show-coords` (parameters: `-rclT -O`). First, the locations of the NORG clusters in other genomes were determined by coords file generated by `show-coords`. If the genome contained the same NORG cluster without structure variations, the alignment was marked as `CONTAINS` in the `.coords` file. For the remaining NORG clusters, the best location of each NORG cluster on each genome was determined by `show-tiling` (parameters: `-g 20000`). The NORG clusters in the same locations were recognized as the same NORGs by BEDTools `intersect`.

script: `norg_synteny_pipeline_opt.bash`

requirement:
- MUMmer 4
- Perl

Since some NORGs were separated by sequence insertions longer than 5-kb, multiple NORG clusters at the same location were merged. 

In the first step, the alignment of NORGs with flanking sequence of 300 bp from the previous step was used directly to merge partially dispersed NORGs. The dot plots of each NORG cluster produced by MUMmerplot were used as an aid to manual calibration. 

In the second step, the flanking sequences of 2000 bp from each NORG cluster in step one were aligned with the genomes without the identification of that NORG. The `.delta` file from the alignment was then treated with `show-tiling` (parameters: `-g 20000 -v 80`) to determine the position of the flanking sequences in each genome. 

In the third step, several NORGs with multiple large insertions were corrected manually using `BLASTN` (Camacho et al. 2009) and `BLAT` (Kent 2002). To assess the accuracy of this workflow, synteny plots of each orthologous NORG group were generated by `genoPlotR` (Guy et al. 2010) to visually validate the NORGs.



Example: norg-4

- synteny : `example\norg-4-synteny`

- The genoPlotR script was `example\norg-4-synteny\norg-4.r`


## Determination of ONG presence or absence based on WGS datasets

The WGS dataset was collected from 3K-RG (Wang et al. 2018b), the study Cubry et al. (2018), Meyer et al. (2016) and Choi et al. (2019) from the Sequence Read Archive (SRA) databases under accession numbers PRJEB6180, PRJEB21312, PRJNA315063 and PRJNA453903, respectively.

To identify the PAVs of ONGs in the rice WGS dataset, we developed a pipeline based on presence/absence genotype across assemblies through four major steps. (1) Sequences of presence/absence genotype of ONGs were prepared using the identification results described above. (2) To reduce the mapping errors caused by repetitive sequences and organelle genome sequences, the rice TE library “rice6.9.5.liban” from EDTA (Ou et al. 2019) and rice organellar genome NC_001320 (NCBI GenBank) and NC_011033 (NCBI GenBank) were merged to the sequences of the previous step. (3) Paired end reads were mapped to the merged sequences by BWA-MEM2 (Vasimuddin et al. 2019). (4) The presence/absence genotype was verified by the mapping results.

To verify that this pipeline could accurately identify ONG presence/absence, we chose five WGS accessions (CX140 | IRGSP1.0, CX145 | MH63RS3, CX133 | ZS97RS3, CX145 | AzucenaRS1, IRIS_313-8813 | Os117425RS1) from 3K-RG corresponding to the genome assemblies used in this study. After filtering out the misidentified ONGs, 754 polymorphic NORGs were retained.

The MDS plots were constructed using the presence/absence information of 754 ONGs for each accession. The distance matrices were calculated using the ‘bray’ method in the ‘vegdist’ function from the R package “vegan” and subsequently computed the MDS using the ‘cmdscale’ function in R (version 4.2.3) (R Core Team 2023).

The presence frequency of ONGs was calculated by using R software (version 4.2.3). The heatmap was plotted by TBtools software (Chen et al. 2020).


### MDS plot for Asian and African populations

See `MDS for supp_code.Rmd` and `MDS for supp_code.html`
