Here sampled and sequenced a total of 104 individuals from a cross between a queen and a male of the ant Cardiocondyla obscurior. This included two grandparents, two F1 parents and 100 F2 individuals (88 workers and 12 males). Details about the cross, our questions as well as our main findings can be found here.
The genome assembly and annotation of C. obscurior as well as the raw reads can be downloaded from NCBI under BioProject: PRJNA934066
ls *_R1_001.fastq.gz|sed 's/_R1_001.fastq.gz//g'|nice parallel --jobs 15 --eta 'java -jar /global/projects/programs/bin/trimmomatic-0.38.jar PE {}_R1_001.fastq.gz {}_R2_001.fastq.gz paired/{}_paired_1.fastq.gz unpaired/{}_unpaired_1.fastq.gz paired/{}_paired_2.fastq.gz unpaired/{}_unpaired_2.fastq.gz SLIDINGWINDOW:4:20 MINLEN:40'
for infile in *_paired_1.fastq.gz; do base=$(basename -s _paired_1.fastq.gz ${infile}); bwa mem -t 25 ~/genome/Cobs3.1.clean.fa ${infile} ${base}_paired_2.fastq.gz > samfiles/${base}.sam;done
# a- clean sam files ls *sam|parallel --progress --eta --jobs 5 'java -jar /global/projects/programs/bin/picard.jar CleanSam I="{}" O="{=s/(.*)\..*/$1/=}C.sam"' # b- covert to bam and sort bam file (sort by default assume sorting by coordinates): ls *C.sam| nice parallel --progress --eta --jobs 5 'samtools view -S -u -b {} | samtools sort -o {.}S.bam' # c- fix mate information if needed ls *CS.bam| nice parallel --progress --eta --jobs 5 'java -jar /global/projects/programs/bin/picard.jar FixMateInformation I="{}" O="{.}F.bam" ADD_MATE_CIGAR=true ASSUME_SORTED=true' # d- fix read group information nice parallel -a rgid.txt --colsep '\t' --jobs 20 --eta java -jar /global/projects/programs/bin/picard.jar AddOrReplaceReadGroups I={1} O={5}R.bam RGID={2} RGLB={4} RGPL=illumina RGPU={3} RGSM={4} # assumes you have a tab separated rgid file with following columns: <bamfile name> <readgroup id> <lane if any> <sample name (unique)> <basename for output> # e- locates and tags duplicate reads in bam file ls *R.bam |nice parallel --progress --eta --jobs 5 'java -jar /global/projects/programs/bin/picard.jar MarkDuplicates I="{}" O="{.}D.bam" M="{.}D.txt" MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 ASSUME_SORTED=true' # f- index bam file samtools index alignment.CSFRD.bam
# a- generate gVCFs ls *D.bam |parallel --progress --eta --jobs 10 '/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar /global/projects/programs/source/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar HaplotypeCaller -R ~/genome/Cobs3.1.clean.fa -I {} -O {.}.g.vcf.gz -ERC GVCF' # b- combine gVCFs /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar /global/projects/programs/source/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar CombineGVCFs -R ~/genome/Cobs3.1.clean.fa -V input.list -O Cobs3.1.linkageMapping.g.vcf.gz # c- genotype the combined gVCF /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar /global/projects/programs/source/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar GenotypeGVCFs -R ~/genome/Cobs3.1.clean.fa -V Cobs3.1.linkageMapping.g.vcf.gz -O Cobs3.1.linkageMapping.vcf.gz
Get bi allelic SNPs
vcftools --gzvcf gCobs3.1.linkageMapping.vcf.gz --remove-indels --min-alleles 2 --max-alleles 2 --recode --recode-INFO-all --out Cobs3.1.linkageMapping.snps # compress and index bgzip Cobs3.1.linkageMapping.snps.recode.vcf tabix Cobs3.1.linkageMapping.snps.recode.vcf.gz
Get heterozygous markers in F1 queen (F1Queen) that are homozygous fixed for different alleles in the parents or heterozygous in grandmother (PQueen)
gatk SelectVariants -V Cobs3.1.linkageMapping.snps.recode.vcf.gz -O biall.snps.noMendelianErrParents.vcf.gz -select 'vc.getGenotype("F1Queen").isHet() && (vc.getGenotype("PQueen").isHomRef() && vc.getGenotype("PMale").isHomVar() || vc.getGenotype("PQueen").isHet() && vc.getGenotype("PMale").isHom()) || vc.getGenotype("F1Queen").isHet() && (vc.getGenotype("PQueen").isHomVar() && vc.getGenotype("PMale").isHomRef() || vc.getGenotype("PQueen").isHet() && vc.getGenotype("PMale").isHom())' -select 'vc.getGenotype("F1Male").isHomVar() && (vc.getGenotype("PQueen").isHomVar() || vc.getGenotype("PQueen").isHet()) || vc.getGenotype("F1Male").isHomRef() && (vc.getGenotype("PQueen").isHomRef() || vc.getGenotype("PQueen").isHet())'
In ants males are haploid. Any position called as heterozygous in males indicates a genotyping error and should be excluded. get only positions where none of the males is heterozygous
gatk SelectVariants -V biall.snps.noMendelianErrParents.vcf.gz -O biall.snps.noMendelianErrParents.noHetMale.vcf.gz --invertSelect true -select 'vc.getGenotype("EM1").isHet()' -select 'vc.getGenotype("EM112").isHet()' -select 'vc.getGenotype("EM2").isHet()' -select 'vc.getGenotype("EM3").isHet()' -select 'vc.getGenotype("EM35").isHet()' -select 'vc.getGenotype("EM36").isHet()' -select 'vc.getGenotype("EM37").isHet()' -select 'vc.getGenotype("EM38").isHet()' -select 'vc.getGenotype("EM39").isHet()' -select 'vc.getGenotype("EM4").isHet()' -select 'vc.getGenotype("EM5").isHet()' -select 'vc.getGenotype("EM78").isHet()' -select 'vc.getGenotype("F1Male").isHet()' -select 'vc.getGenotype("PMale").isHet()'
Check and exclude genotypes showing mendelian discrepancies using bcftools +mendelian function. This concerns the diploid workers. Males get one of the two F1 queen haplotypes and we have removed heterozygous positions anyways. For the workers however, there might be instances of mendelian errors. These are either genotyping errors or novel mutations, that should be excluded. For instance if the F1 queen is 0/1 and the F1 father is 1/1 the only two possible F2 genotypes are 0/1 or 1/1 (possible genotype), but never 0/0 (impossible genotype). If 0/0 then that should be switched to missing. In the next section we extract the genotypes of each of the workers and compare it to the F1 father. If impossible genotype, we replace it with missing (./.).
## 1) get only genotypes and switch | to / cat <(zgrep "#" biall.snps.noMendelianErrParents.noHetMale.vcf.gz) <(bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER\t%INFO\tGT[\t%GT]\n' biall.snps.noMendelianErrParents.noHetMale.vcf.gz)|sed 's@|@/@g' > biall.snps.noMendelianErrParents.noHetMale.GT.vcf bgzip biall.snps.noMendelianErrParents.noHetMale.GT.vcf tabix biall.snps.noMendelianErrParents.noHetMale.GT.vcf.gz ## 2) # get worker + F1 father samples sample=$(zgrep "^#CH" biall.snps.noMendelianErrParents.noHetMale.GT.vcf.gz|cut -f10-|tr "\t" "\n"|grep "^W") for i in $sample; do bcftools view -s F1Male,$i biall.snps.noMendelianErrParents.noHetMale.GT.vcf.gz -Ov -o F1Male.$i.biall.snps.noMendelianErrParents.noHetMale.GT.vcf;done ## 3) loop over the different vcf files, check of genotypes of a given worker match the expected one, knowing the genotype of the father. If not replace with ./. ### replace XXX with the value from grep "^#" F1Male.$i.biall.snps.noMendelianErrParents.noHetMale.GT.vcf for i in F1Male.W*.vcf; do base=$(basename -s .vcf $i); awk -vFS="\t" -vOFS="\t" 'NR<=XXX{print $0}; FNR > XXX {if ($10=="1/1" && $11 == "0/0") {print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, "./."} else if ($10=="0/0" && $11 == "1/1") {print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, "./."} else {print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11} }' $i > "${base}.ed.vcf";done ## 4) get the edited worker genotypes for i in $sample; do bcftools view -s $i F1Male.$i.biall.snps.noMendelianErrParents.noHetMale.GT.ed.vcf -Ov -o $i.biall.snps.noMendelianErrParents.noHetMale.GT.ed.vcf; done for i in W*.vcf; do bgzip $i;done for i in W*.vcf.gz; do tabix $i;done ## 5) get males and parents bcftools view -s PMale,PQueen,F1Male,F1Queen,EM1,EM112,EM2,EM3,EM35,EM36,EM37,EM38,EM39,EM4,EM5,EM78 biall.snps.noMendelianErrParents.noHetMale.GT.vcf.gz -Ov -o grandParentsF1Parents.males.vcf bgzip grandParentsF1Parents.males.vcf tabix grandParentsF1Parents.males.vcf.gz ## 6) merge everything bcftools merge grandParentsF1Parents.males.vcf.gz $(ls W*.ed.vcf.gz) -Oz -o biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.vcf.gz tabix biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.vcf.gz ## 7) clean rm F1Male.W* rm W* rm grandParentsF1Parents.males.vcf.gz* ## add original annotations (i.e. metrics associated with each genotype) to the edited vcf file bcftools annotate -a ../biall.snps.noMendelianErrParents.noHetMale.vcf.gz -c INFO,^FORMAT/GT biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.vcf.gz -Oz -o biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.ANN.vcf.gz ## 8) compare mendelian error before/after to check if everything worked paste <(bcftools +mendelian biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.ANN.vcf.gz -T pedigree.txt ) <(bcftools +mendelian biall.snps.noMendelianErrParents.noHetMale.vcf.gz -T pedigree.txt) > afterANDbefore.txt
Use VCFtools to filter based on coverage, missingness and maf.
vcftools --gzvcf biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.ANN.vcf.gz --max-meanDP 50 --min-meanDP 19 --minDP 5 --max-missing 0.8 --maf 0.2 --recode --recode-INFO-all --out biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.ANN.filtered
Males are haploid and the direct product of the F1 queen. The 88 workers included in our design are diploid and have one haplotype from the F1 father plus the one from their F1 mother. We are looking at recombination in the F1 queen and thus interested in the maternal contribution in these workers. In the following section we exclude the father's genotype/contribution from each of the workers to only keep the maternal recombinant haplotype.
# get genotypes + position etc into a text file cat <(bcftools view biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.ANN.filtered.recode.vcf|grep "#C"|cut -f1-5,10-) <(bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%GT]\n' biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.ANN.filtered.recode.vcf) > phasing/unphasedMarkers.txt # replace missing (./.) genotypes with -, homozygous alt by 1, heterozygous genotypes by 01 and homozygous ref by 0. This is what each of the sed is doing paste <(cut -f1-5 unphasedMarkers.txt) <(cut -f6- unphasedMarkers.txt|sed 's@\./\.@-@g'|sed -E 's@1/1|1\|1@1@g'|sed -E 's@0/1|0\|1@01@g'| sed -E 's@0/0|0\|0@0@g') > unphasedMarkers.forPhasing.txt # get F1male, F1Queen and workers' genotypes only cut -f8,9,22- unphasedMarkers.forPhasing.txt > F1parentsF2workers.tsv # delete parental contribution from heterozygous genotypes (01). Add these "phased" worker genotypes to the F2 males data in cut -f1-5,10-21 unphasedMarkers.forPhasing.txt and save everything into phasedMarkers.noMissF1male.tsv. exclude markers where we don't have genotype info for F1Male by awk '{if ($18 != "-") print $0}'. paste <(cut -f1-5,10-21 unphasedMarkers.forPhasing.txt) <(awk 'BEGIN{FS=OFS="\t"} {for (i=2; i<=NF; i++) if ($i == "01") gsub($1, "", $i)}1' F1parentsF2workers.tsv) | awk '{if ($18 != "-") print $0}' > phasedMarkers.noMissF1male.tsv # get annotations from unphased vcf file bcftools view biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.ANN.filtered.recode.vcf |grep -v "##"|awk 'BEGIN{FS=OFS="\t"} {for (i=1; i<9; i++) printf $i FS; print "GT"}' > vcf.annotations.tsv # change first GT to FORMAT!
Now switch to R to generate a phased vcf file
# add annotations from vcf file before creating a vcf file phased <- read_delim("phasing/phasedMarkers.noMissF1male.tsv", col_names = T, delim = "\t") ann <- read_delim("phasing/vcf.annotations.tsv", col_names = T, delim = "\t") combined <- right_join(dplyr::select(ann,c(`#CHROM`, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT)), dplyr::select(phased,c(`#CHROM`,POS,ID,REF,ALT,F1Male,EM1,EM112,EM2,EM3,EM35,EM36,EM37,EM38, EM39,EM4,EM5,EM78,W101,W103,W106,W107,W109,W110, W111,W13,W14,W15,W16,W17,W18,W19,W200,W201,W202,W204,W205,W206, W207,W211,W212,W221,W23,W25,W26,W27,W29,W30,W300,W301,W302,W303, W304,W32,W33,W34,W40,W41,W42,W43,W44,W45,W46,W49,W50,W51,W52,W53, W55,W56,W57,W58,W59,W6,W60,W61,W62,W63,W64,W68,W69,W7,W70,W71,W72, W73,W74,W75,W76,W77,W78,W8,W80,W81,W82,W83,W84,W85,W86,W88,W89,W9, W90,W91,W92,W97,F1Queen))) write.table(combined, "phasing/vcf.body.phased.tsv", sep = "\t",col.names = T, quote = F, row.names = F, na = "-")
Return to bash and produce the final clean vcf file with phased gentotypes
# add headers, "diploidize" and switch missing (-) to ./. cat <(bcftools view biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.ANN.filtered.recode.vcf |grep "##") <(paste <(cut -f1-9 vcf.body.phased.tsv ) <(cat <(cut -f10- vcf.body.phased.tsv |grep "^F") <(cut -f10- vcf.body.phased.tsv | grep -v "^F"|sed 's@1@1/1@g'|sed 's@0@0/0@g'|sed 's@-@./.@g'))) > biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.ANN.filtered.recode.phased.vcf bgzip biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.ANN.filtered.recode.phased.vcf tabix biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.ANN.filtered.recode.phased.vcf.gz # add annotations (in FORMAT) bcftools annotate -a biall.snps.noMendelianErrParents.noHetMale.vcf.gz -c INFO,^FORMAT/GT phasing/biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.ANN.filtered.recode.phased.vcf.gz -Oz -o biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.filtered.phased.vcf.gz
# get genotypes of the 100 F2 individuals and recode them into a format required by MSTmap, i.e. homozygous alt (1/1 into A), homozygous ref (0/0 into B) and ./. into - bcftools view -s ^F1Male,F1Queen biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.filtered.phased.vcf.gz |bcftools query -f '%CHROM\t%POS[\t%GT]\n' |sed 's@1/1@A@g'|sed 's@0/0@B@g'|sed 's@./.@-@g'|sed 's/\./-/g' > markers.tsv # add marker ID and sample names cat <(bcftools view -s ^F1Male,F1Queen biall.snps.noMendelianErrParents.noHetMale.GT.noMenErr.filtered.phased.vcf.gz |grep "#C"|cut -f1,2,10-) markers.tsv |awk 'BEGIN{FS=OFS="\t"}{$1=$1"_"$2;$2=""; print $0}'|tr -s "\t" > markers.set.for.MSTmap.txt # then duplicate markers awk -vOFS="\t" '{$1 = $1"-1"; print}' markers.set.for.MSTmap.txt > markers.for.MSTmap.1.tmp.txt awk -vOFS="\t" '{$1 = $1"-2"; print}' markers.set.for.MSTmap.txt > markers.for.MSTmap.2.tmp.txt # flip phase in one file cat markers.for.MSTmap.2.tmp.txt|sed 's/A/0/g'|sed 's/B/A/g'|sed 's/0/B/g' > markers.for.MSTmap.2.prime.txt # merge the two sets of markers and exclude markers on unplaced scaffolds (had less than two markers anyways) cat markers.for.MSTmap.1.tmp.txt <(tail +2 markers.for.MSTmap.2.prime.txt)|grep -v "scaff" > markers.for.MSTmap.readyToMap.txt # run MSTmap ~/softwares/MSTmap/mstmap markers.for.MSTmap.readyToMap.txt markers.for.MSTmap.mapped.txt >> mstmap.stdout 2>&1
To split chimeric scaffolds (i.e. scaffolds where markers are assigned to more than one LG), we checked for the existence of gaps separating the parts belonging to different LGs. When gaps were absent, chimeric scaffolds were split after the last marker of the first part. For misoriented scaffolds, we visually inspected the correlation between the physical and genetic positions of markers.
# get gaps in ref: perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++; if($_ eq "N" && $s ==0 ){$z=$i; print "$head\t$z"; $s =1}elsif($s==1 && $_ ne "N"){$j=$i-1;print "\t$j\n";$s=0}}' ~/genome/Cobs.alpha.2.1.fa > Cobs2.1.gaps.bed # look at composition of each LG (from MSTmap) and create a bed file with the name of scaffolds, start and end positions, and the orientation of scaffolds forming the LG. # See attached tables with the paper for an example # get the refined sequence for each LG using the new coordinates bed file for i in LG*.bed; do base=$(echo $i|cut -f1 -d"."); bedtools getfasta -fi ~/genome/Cobs.alpha.2.1.fa -bed $i -s |fold -w 60 > $base.fa;done # add LG name to each sequence for i in LG*.fa; do l=$(echo $i|cut -f1 -d"."); echo $l $i; echo ">"$l > $l.final.fa; cat $i|grep -v ">" >> $l.final.fa;done # merge them into one fasta file cat LG*.final.fa|seqkit sort -l -r - > LGs.sorted.fa # get sequences not placed in any LG cat LG*bed |sort -V -k1,1 -k2,2 > regionsInLGs.bed bedtools complement -i regionsInLGs.bed -g Cobs2.1.chrom-size.txt > regionsNotInLGs.bed bedtools getfasta -fi ~/sciebo/My_PhD/Cobs_genomic_resources/genome/Cobs.alpha.2.1.fa -bed regionsNotInLGs.bed| seqkit sort -l -r - > regionsNotInLGs.sorted.fa # merge LG groups and unplaced sequences cat LGs.sorted.fa regionsNotInLGs.sorted.fa > Cobs3.0.tmp.fa # And to remove sequences with only NNNs seqtk comp Cobs3.0.tmp.fa|awk '{c=0;for(i=3;i<=6;++i){c+=$i}; gsub(">",""); if(c>0) print $1}'|xargs samtools faidx Cobs3.0.tmp.fa > Cobs3.0.clean.fa picard NormalizeFasta I=Cobs3.0.clean.fa O=Cobs3.1.clean.fa
Windowed recombination rates were estimated from the MSTmap output. We plotted genetic versus physical distance and visually checked correlations. Markers showing discrepencies between genetic and physical distances were removed. The final ordered markers file can be found in the datasets folder. I would like to thank Antonia Klein! Here code was a very great help here.
# load libraries library(tidyverse) library(ggpubr) # load data lg <- read_delim("datasets/manuallyCurated.markers.forR.txt", col_names = F, delim = "\t") colnames(lg) <- c("chr", "pos", "id", "distance1", "lg", "distance") head (lg) # reorder lg: plotthis <- as.data.frame(lg[,c(5,6,3)]) colnames(plotthis) <- c("group", "position", "locus") head(plotthis) ## by lg size df <- plotthis %>% group_by(group) %>% summarise(max(position)) df <- df[order(df$`max(position)`, decreasing = T),] df$group ## by no of markers in each lg df2 <- plotthis %>% group_by(group) %>% summarise(length(position)) df2 <- df2[order(df2$`length(position)`, decreasing = T),] df2$group paste(shQuote(as.list(df$group)), collapse=", ") lgs <- c('LG1', 'LG2', 'LG3', 'LG4', 'LG5', 'LG6', 'LG7', 'LG8', 'LG9', 'LG10', 'LG11', 'LG12', 'LG13', 'LG14', 'LG15', 'LG16', 'LG17', 'LG18', 'LG19', 'LG20', 'LG21', 'LG22', 'LG23', 'LG24', 'LG25', 'LG26', 'LG27', 'LG28', 'LG29') paste(shQuote(as.list(sub("LG", "", lgs))), collapse=", ") mcombsub <- reshape2::melt(lg, id=c("chr", "pos", "id", "lg", "distance1")) mcombsub$CHR <- as.numeric(sub("LG", "", lg$lg)) mcombsub$CHR <- factor(mcombsub$CHR, levels = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29')) # plot physical vs genetic distances: ########################################## ### Plot physical vs genetic distances ### ########################################## ggplot(filter(mcombsub, CHR %in% c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29')), aes(pos/1000000, value, color=variable))+ theme_bw()+ geom_jitter(aes(), size=.4, alpha=.2)+ geom_line(aes()) + stat_cor(method = "s",label.y.npc="top", label.x.npc = "left", cor.coef.name = "rho", cex=2, colour = "red")+ scale_fill_manual(values=c("#66c2a5"))+ scale_color_manual(values=c("#66c2a5"))+ theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(colour = "black") )+ scale_y_continuous(name=expression(paste("genetic distance (cM)", italic(" "))))+ scale_x_continuous(name=expression(paste("physical distance (Mb)", italic(" "))))+ theme(legend.position = "none")+ # facet_grid(cols=vars(CHR), scales = "free", space = "free_x")+ facet_wrap(~CHR, ncol=5, nrow = 6, scales = "free")+ theme(panel.spacing.x=unit(.5,"lines"), panel.spacing.y=unit(.5,"lines"), panel.background = element_blank(), axis.text.x = element_text(size=7, angle=45), #axis.ticks.x = element_blank(), panel.spacing = unit(.1, "cm"), #panel.background=element_rect(fill='white', colour='black'), strip.background=element_rect(fill='white', colour='black'))+ ggtitle("")+ guides(colour = guide_legend(nrow = 1)) # dev.print(pdf,"~/sciebo/My_PhD/Cobs3.1/linkageMap/analyses/RRfinalMarkerSet/Figure.S2.pdf", height=10, width=8) ################################################## ### Plot Marker distribution along chromosomes ### ################################################## # load data lg <- read_delim("datasets/manuallyCurated.markers.forR.txt", col_names = F, delim = "\t") colnames(lg) <- c("chr", "pos", "id", "distance1", "lg", "distance") lg$scf = as.numeric(gsub("LG", "", lg$chr)) head (lg) # load chromosome sizes scf = read.table("datasets/Cobs3.1.chrom-size.txt") scf = scf[seq(1,29,1),] scf = cbind(scf, seq(1,29,1)) names(scf) = c("scf", "length", "number") head(scf) # plot bp <- barplot(scf$length/1E6, border=NA, col="grey90", ylim=c(0,15), xlab="linkage group", ylab="length in Mb", names.arg=scf$number, cex.axis=.7, cex.names=.7, space = 10) with(lg, segments( bp[scf,]-0.5, pos/1E6, bp[scf,]+0.5, pos/1E6, col="black", lwd=.5, lend="butt", ) ) ########################################## ### Plot markers and linkage groups ### ########################################## # load library library("LinkageMapView") # output file outfile = file.path("sciebo/My_PhD/Cobs3.1/linkageMap/analyses/RRfinalMarkerSet/manuallyCurated.markers.forR.LGs.v2.pdf") # plot lmv.linkage.plot(plotthis,outfile, mapthese=c('LG7', 'LG2', 'LG3', 'LG5', 'LG1', 'LG8', 'LG21', 'LG9', 'LG15', 'LG20', 'LG23', 'LG24', 'LG14', 'LG10', 'LG25', 'LG16', 'LG19', 'LG11', 'LG4', 'LG27', 'LG13', 'LG22', 'LG18', 'LG29', 'LG12', 'LG17', 'LG6', 'LG28', 'LG26'), dupnbr=TRUE, lg.col = "#CCCCFF", lgperrow = 15, ruler = T, pdf.width = 56, pdf.height = 40) ###################################### ### Computing gwRecombination Rates### ###################################### # recombination between recombining markers rr1 <- lg %>% group_by(chr) %>% arrange(pos, .by_group = TRUE) %>% mutate(RR = distance - lag(distance, default = first(distance)), PhysicalDist = (pos - lag(pos, default = first(pos)))/1000000, pos2=lag(pos, default = first(pos))) # cluster non recombining markers into regions m1 <- lg %>% group_by(chr,distance) %>% arrange(pos, .by_group = TRUE) %>% filter(row_number() == 1 | row_number() == n()) # calculate recombination rate as: the ratio of the genetic to physical distance (in cM/Mb) rr <- m1 %>% group_by(chr) %>% arrange(pos, .by_group = TRUE) %>% mutate(clusterGeneticDistMb = (distance - lag(distance, default = first(distance))), clusterPhysicalDistcM = (pos - lag(pos, default = first(pos)))/1000000, recombRatecMMb = (distance - lag(distance, default = first(distance)))/((pos - lag(pos, default = first(pos)))/1000000)) # save this! write_delim(rr, "manuallyCurated.markers.RecomRateBetweenClustersGold.txt", delim = "\t") # you need to edit the output in text editor to get recombination rates in a bed file and subsequently use bedtools intersect! # bedtools makewindows -g chrom-size.txt -w 250000 -i srcwinnum |bedtools sort > 250kbwindows.bed # bedtools intersect -a recombinationRateinBed.bed -b 250kbwindows.bed -wa -wb > recombinationRatein250kb.windows.txt ################################################## ### CALCULATE RECOMEBINATION IN SLIDING WINOWS ### ################################################## data = read.table("recombinationRatein250kb.windows.txt", header=F) head(data) data = data[ ,c(1, 2, 3, 4, 6, 7, 8)] names(data) <- c("scf","scf_position1","scf_position2", "RR", "window_start", "window_end", "window_name") # calculate weighted mean of "RR" for each window: ################################################## datalist = list() for (j in unique(data$scf)) { print(j) data1 <- data[data$scf==j,] data1$window_name_short = as.numeric(gsub("^[^_]*_", "", data1$window_name)) dt = data1 dt$distance_bp <- data1$scf_position2-data1$scf_position1 df = data.frame(window_name_short=seq(1,max(dt$window_name_short),1), window_size=NA, mean_RR=NA, weighted_mean_RR=NA) for(i in unique(dt$window_name_short)) { window = paste("window", i, sep="") assign(window, subset(dt, window_name_short==i)) # if the window contains only one row, set window size to 1 and use its RR value for mean and weighted mean calculation if(nrow(get(paste("window",i,sep="")))==1) { df$window_size[i] <- "1" df$mean_RR[i] <- dt[dt$window_name_short==i, ]$RR df$weighted_mean_RR[i] <- dt[dt$window_name_short==i, ]$RR } else { df$window_size[i] <- nrow(get(paste("window",i,sep=""))) # calculate mean RR value for the current window and store in 'df' data frame df$mean_RR[i] <- mean(dt[dt$window_name_short==i, ]$RR) # calculate weighted mean RR value # save rr of this window in a new vector: rr_window = paste("rr_window", i, sep="") assign(rr_window, dt[dt$window_name_short==i, ]$RR) # save physical distances of this window in a new vector: dist_window = paste("dist_window", i, sep="") assign(dist_window, dt[dt$window_name_short==i, ]$distance_bp) # calculate rr_window * dist_window product = paste("product", i, sep="") assign(product, get(paste("rr_window", i, sep="")) * get(paste("dist_window", i, sep=""))) # add: sum = paste("sum", i, sep="") assign(sum, sum(get(paste("product", i, sep="")))) # sum_distance: sum_distance = paste("sum_distance", i, sep="") assign(sum_distance, sum(get(paste("dist_window", i, sep="")))) # weighted_mean: w_mean = paste("w_mean", i, sep="") assign(w_mean, get(paste("sum", i, sep="")) /get(paste("sum_distance", i, sep="")) ) df$weighted_mean_RR[i] <- get(paste("w_mean", i, sep="")) } } # merge data_seq: ################## merged = merge(data1, df, by="window_name_short") # add column with window mid position as midPos -values: merged$midPos = (as.numeric(merged$window_end) + as.numeric(merged$window_start))*1E-6/2 merged_unique = merged[!duplicated(merged[,"window_name"]),] datalist[[j]] <- merged_unique # remove objects with names containing 'window', 'product', 'distance', 'sum', and 'mean' rm(list = ls(pattern = "window|product|distance|sum|mean")) } windowedRR = do.call(rbind, datalist) # add windows without markers: nonoverlapping = read.table("250kbwindows.bed", header=F) names(nonoverlapping) = c("scf","window_start","window_end","window_name") df2 = merge(windowedRR, nonoverlapping, by="window_name", all.x=T, all.y=T) # store this! write_delim(df2, "finalRecombinationRateEstimates.250kbwindows.txt", delim = "\t")
For estimating nucleotide diversity (π), Tajima's D and genetic differentiation (FST) from pool-seq data we followed workflow described before accessible here. For nucleotide diversity at synonymous (πS) and nonsynonymous (πNS) sites, we used script, and available under PoPoolation as follow. These metrics were calculated in nonoverlapping 250-kb windows.
# for Itabuna pool popoolation_1.2.2/syn-nonsyn/Syn-nonsyn-sliding.pl --measure pi --pool-size 30 --gtf Cobs.alpha.v.3.1.geneannotation.1.5.polished.gtf --pileup A006200053_109203_S22_L002_CSFRD_indelsfree.pileup --output pool_Itabuna_syn-nonsyn_250kb250kb_nsl_p1.pi --window-size 250000 --step-size 250000 --fastq-type sanger --min-coverage 4 --max-coverage 75 --codon-table popoolation_1.2.2/syn-nonsyn/codon-table.txt --nonsyn-length-table popoolation_1.2.2/syn-nonsyn/nsl_p1.txt # for Una pool popoolation_1.2.2/syn-nonsyn/Syn-nonsyn-sliding.pl --measure pi --pool-size 50 --gtf Cobs.alpha.v.3.1.geneannotation.1.5.polished.gtf --pileup A006200053_109205_S23_L002_CSFRD_indelsfree.pileup --output pool_Una_syn-nonsyn_250kb250kb_nsl_p1.pi --window-size 250000 --step-size 250000 --fastq-type sanger --min-coverage 4 --max-coverage 77 --codon-table popoolation_1.2.2/syn-nonsyn/codon-table.txt --nonsyn-length-table popoolation_1.2.2/syn-nonsyn/nsl_p1.txt
Using GATK as described above and 16 signle individual genomes available at NCBI (BioProject: PRJNA680013), we identified variants. We then selected inDels and filtered them as follow:
# get indels: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar /global/projects/programs/source/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar SelectVariants -V Cobs3.1.singleIndv.variants.vcf.gz --select-type-to-include INDEL -O Cobs3.1.singleIndv.indels.vcf.gz # apply GATK recommended hard filters: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar /global/projects/programs/source/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar VariantFiltration -V Cobs3.1.singleIndv.indels.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "QUAL < 30.0" --filter-name "QUAL30" -filter "FS > 200.0" --filter-name "FS200" -filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" -O Cobs3.1.singleIndv.indels_F.vcf.gz # get indels passing filters: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar /global/projects/programs/source/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar SelectVariants -V Cobs3.1.singleIndv.indels_F.vcf.gz -select 'vc.isNotFiltered()' -O Cobs3.1.singleIndv.indelsPass.vcf.gz # further filtering using VCFtools: vcftools --gzvcf Cobs3.1.singleIndv.indelsPass.vcf.gz --recode-INFO-all --recode --maf 0.05 --max-missing 0.8 --out Cobs3.1.singleIndv.indelsPass.biall.MolEcolSamples.maf0.05.miss0.8 # calculate pi: vcftools --vcf Cobs3.1.singleIndv.indelsPass.biall.MolEcolSamples.maf0.05.miss0.8.recode.vcf --window-pi 250000 --out Cobs3.1.singleIndv.indelsPass.biall.MolEcolSamples.maf0.05.miss0.8.freq # genic vs intergenic: ### get genic indels: bcftools view -R Cobs3.1.genes.sorted.bed Cobs3.1.singleIndv.indelsPass.biall.MolEcolSamples.maf0.05.miss0.8.recode.vcf.gz -Ov -o Cobs3.1.singleIndv.indelsPass.biall.MolEcolSamples.maf0.05.miss0.8.recode.genic.vcf # then calculate pi: vcftools --vcf Cobs3.1.singleIndv.indelsPass.biall.MolEcolSamples.maf0.05.miss0.8.recode.genic.vcf --window-pi 250000 --out Cobs3.1.singleIndv.indelsPass.biall.MolEcolSamples.maf0.05.miss0.8.recode.genic.freq ### get inetrgenic indels: # get intergenic intervals: bedtools complement -i Cobs3.1.genes.sorted.bed -g Cobs3.1.clean.submitted.chrom.size.sorted.tsv > Cobs3.1.intergenes.bed # extract intergenic indels: bcftools view -R Cobs3.1.intergenes.bed Cobs3.1.singleIndv.indelsPass.biall.MolEcolSamples.maf0.05.miss0.8.recode.vcf.gz -Ov -o Cobs3.1.singleIndv.indelsPass.biall.MolEcolSamples.maf0.05.miss0.8.recode.intergenic.vcf # calculate pi: vcftools --vcf Cobs3.1.singleIndv.indelsPass.biall.MolEcolSamples.maf0.05.miss0.8.recode.intergenic.vcf --window-pi 250000 --out Cobs3.1.singleIndv.indelsPass.biall.MolEcolSamples.maf0.05.miss0.8.recode.intergenic.freq
##################################### # load genomes,CDSs and peptide sequences # Solenopsis invicta wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/metazoa/fasta/solenopsis_invicta/dna/Solenopsis_invicta.UNIL_Sinv_3.0.dna.toplevel.fa.gz wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/metazoa/fasta/solenopsis_invicta/cds/Solenopsis_invicta.UNIL_Sinv_3.0.cds.all.fa.gz wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/metazoa/fasta/solenopsis_invicta/pep/Solenopsis_invicta.UNIL_Sinv_3.0.pep.all.fa.gz # monomorium pharaonis wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/metazoa/fasta/monomorium_pharaonis_gca013373865v2/dna/Monomorium_pharaonis_gca013373865v2.GCA013373865v2.dna.toplevel.fa.gz wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/metazoa/fasta/monomorium_pharaonis_gca013373865v2/cds/Monomorium_pharaonis_gca013373865v2.GCA013373865v2.cds.all.fa.gz wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/metazoa/fasta/monomorium_pharaonis_gca013373865v2/pep/Monomorium_pharaonis_gca013373865v2.GCA013373865v2.pep.all.fa.gz # ooceraea biroi wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/metazoa/fasta/ooceraea_biroi_gca003672135v1/dna/Ooceraea_biroi_gca003672135v1.Obir_v5.4.dna.toplevel.fa.gz wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/metazoa/fasta/ooceraea_biroi_gca003672135v1/cds/Ooceraea_biroi_gca003672135v1.Obir_v5.4.cds.all.fa.gz wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/metazoa/fasta/ooceraea_biroi_gca003672135v1/pep/Ooceraea_biroi_gca003672135v1.Obir_v5.4.pep.all.fa.gz # Nasonia vitripennis wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/metazoa/fasta/nasonia_vitripennis/dna/Nasonia_vitripennis.Nvit_psr_1.1.dna.toplevel.fa.gz wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/metazoa/fasta/nasonia_vitripennis/pep/Nasonia_vitripennis.Nvit_psr_1.1.pep.all.fa.gz wget http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/metazoa/fasta/nasonia_vitripennis/cds/Nasonia_vitripennis.Nvit_psr_1.1.cds.all.fa.gz # get the same for C. obscurior ln -s ~/data/genomes/Cardiocondyla_obscurior/Cobs3.1/Cobs3.1.clean.fa . ln -s ~/data/genomes/Cardiocondyla_obscurior/Cobs3.1/annotations/geneAnnotation/Cobs.alpha.v.3.1.geneannotation.1.5.polished.protein.fa . ##################################### # use agat (https://github.com/NBISweden/AGAT) to get CDSs agat_sp_extract_sequences.pl -g ~/data/genomes/Cardiocondyla_obscurior/Cobs3.1/annotations/geneAnnotation/Cobs.alpha.v.3.1.geneannotation.1.5.polished.gff3 -f ../genomes/Cobs3.1.clean.fa -t cds --output Cobs.alpha.v.3.1.geneannotation.1.5.polished.cds.fa # Simplify headers by keeping only transcriptID:geneID, for instance >XM_011164587.3_gene:LOC105197943 or >XM_028189372.2_gene:LOC105840702 or >COBS16369-mRNA-1_gene:COBS16369 for i in Mpha Nvit Obir Sinv; do echo $i; awk -F' ' '{print (/^>/ ? $5"_"$4 : $0)}' $i.pep.fa|sed 's/transcript:/>/g' > tmp.txt && mv tmp.txt $i.pep.fa;done cat pep/Cobs.pep.fa |awk -F' ' '{print (/^>/ ? $1"_"$2 : $0)}'|sed 's/=/:/g' > tmp.txt && mv tmp.txt pep/Cobs.pep.fa # get longest isoform peptide sequence; speciesList.txt has species names (also used as file names), i.e. Mpha Nvit Obir Sinv Cobs https://bioinf.shenwei.me/seqkit/ for i in $(cat speciesList.txt); do seqkit fx2tab -l pep/$i.pep.fa|sed 's/_gene:/\t/g' |sort -k2,2 -k4,4nr|sort -k2,2 -u -s|awk '{print ">"$1";gene="$2";length="$4"\n"$3}' > longestIso/longest.$i.pep.fa;done # get longest isoform/transcript ID for i in $(cat speciesList.txt );do grep "^>" longestIso/longest.$i.pep.fa|cut -f1 -d";"|sed 's/>//g' > cds/$i.longest_cds.txt ; done for i in *.fa; do awk '/^>/ {$0=$1} 1' $i > tmp.fa && mv tmp.fa $i;done # get longest isoform CDSs sequence for i in $(cat ../speciesList.txt );do seqkit grep -n -f $i.longest_cds.txt $i.cds.fa > longest.$i.cds.fa ; done #removed -r to match full id # unify headers for i in $(cat speciesList.txt); do cat cds/longest.$i.cds.fa|sed "/^>/ s/$/_$i/" > cds/longest.simpleH.$i.cds.fa;done for i in $(cat speciesList.txt); do cat pep/longest.$i.pep.fa|sed "s/;.*$/_$i/g" > pep/longest.simpleH.$i.pep.fa;done ##################################### # prepare for OrthoFinder mkdir orthofinder && mkdir inputs ln -s ~/data/Cobs3.1/linkageMap/dNdS/inputs/pep/longest.simpleH.* . # run OrthoFinder nice orthofinder -f inputs/ -t 25 > orthofinder.log ##################################### # prepare for sequence alignments mkdir alignments # get ortholgs prot cat orthofinder/inputs/OrthoFinder/Results_Oct08_1/Orthogroups/Orthogroups_SingleCopyOrthologues.txt |while read group ; do cp orthofinder/inputs/OrthoFinder/Results_Oct08_1/Orthogroup_Sequences/"$group".fa alignments/"$group".faa ; done # get corresponding cds cat orthofinder/inputs/OrthoFinder/Results_Oct08_1/Orthogroups/Orthogroups_SingleCopyOrthologues.txt |while read group ; do cat speciesList.txt | while read species ; do grep "$species" ./alignments/"$group".faa | sed 's/>//g' | seqkit grep -n -f - cds/longest.simpleH."$species".cds.fa >> ./alignments/"$group".fna ; done ; done # produce alignment cat ~/data/Cobs3.1/linkageMap/dNdS/inputs/orthofinder/inputs/OrthoFinder/Results_Oct08_1/Orthogroups/Orthogroups_SingleCopyOrthologues.txt |nice parallel --jobs 25 --eta 'clustalo -i {1}.faa -o {1}.aln.faa' # converts a alignments and corresponding cds sequences into a codon alignment cat ~/data/Cobs3.1/linkageMap/dNdS/inputs/orthofinder/inputs/OrthoFinder/Results_Oct08_1/Orthogroups/Orthogroups_SingleCopyOrthologues.txt |time parallel --eta pal2nal.pl {1}.aln.faa {1}.fna -output fasta -nogap ">" {1}.pal2nal.fasta "2>" {1}.pal2nal.fasta.log # remove failed and erroneous alignments grep "ERROR" *.log | sed 's/.pal2nal.*//g' > failed.txt cat ~/data/Cobs3.1/linkageMap/dNdS/inputs/orthofinder/inputs/OrthoFinder/Results_Oct08_1/Orthogroups/Orthogroups_SingleCopyOrthologues.txt | grep -v -f failed.txt > passedAlignments.txt # filter codon alignments ls *.fasta| nice parallel --jobs 7 --eta '/global/projects/programs/source/Gblocks_0.91b-new/Gblocks {} -t=c -p=y' >> gblocks.stdout 2>&1 # run RAxML for each gene ls *.fasta-gb|nice parallel --eta --jobs 30 'raxmlHPC -m GTRGAMMA -p 12345 -# 20 -s {} -n {.}.tree1 -T 1' ls *.fasta-gb|nice parallel --eta --jobs 30 'raxmlHPC -m GTRGAMMA -p 12345 -b 12345 -# 100 -s {} -n {.}.tree2 -T 1' ls *.fasta-gb|nice parallel --eta --jobs 30 'raxmlHPC -m GTRCAT -p 12345 -f b -t RAxML_bestTree.{.}.tree1 -z RAxML_bootstrap.{.}.tree2 -n {.}.tree3 -T 1' # Run aBSREL cat passedAlignments.txt | while read group; do hyphy absrel --alignment sequence/"$group".pal2nal.fasta --tree bestTrees/RAxML_bipartitionsBranchLabels."$group".pal2nal.tree3 --output absrel/absrel-hyphy-"$group".json >> absrel/"$group".absrel.log 2>&1;done # Run the FITMG94 model: cat /home/m/merrbii/data/Cobs3.1/linkageMap/dNdS/inputs/hyphy/passedAlignments.txt |while read group; do nice ./HYPHYMP hyphy-analyses/FitMG94/FitMG94.bf --alignment ~/data/Cobs3.1/linkageMap/dNdS/inputs/hyphy/sequence/"$group".pal2nal.fasta --tree ~/data/Cobs3.1/linkageMap/dNdS/inputs/hyphy/bestTrees/RAxML_bipartitionsBranchLabels."$group".pal2nal.tree3 --output ~/data/Cobs3.1/linkageMap/dNdS/inputs/hyphy/FitMG94/fitMG94-hyphy-"$group".json --type local >> ~/data/Cobs3.1/linkageMap/dNdS/inputs/hyphy/FitMG94/"$group".fitMG94.log 2>&1;done
Here we used publickly available RNA-seq data generated from worker, queen, wingless male, and winged male third instar larvae (n = 7 individuals each; BioProject: PRJNA237579).
# download data: sort -u SRR_list.txt | parallel -j 28 "~/programs/sratoolkit.2.10.0-ubuntu64/bin/fastq-dump --split-3 {}" # remove rRNA: for i in $(cat filename.txt); do base=$(basename -s .fastq $i); echo "processing "$i ;nice sortmerna -ref $rRNA_databases/silva-euk-28s-id98.fasta -ref $rRNA_databases/silva-euk-18s-id95.fasta -reads ../${base}.fastq -workdir cobs3.1_rnaseq_linkageMapping/raw/sortmerna/ -fastx -other $base -threads 28 ; mv out/aligned_fwd.fq out/${base}.aligned_fwd.fq; mv out/aligned.log out/${base}.aligned.log;rm -rf kvdb/ ;rm -rf readb/;done # remove short/adapter sequences ls *fastq|sed 's/.fastq//g'|nice parallel --jobs 15 --eta 'java -jar /global/projects/programs/bin/trimmomatic-0.38.jar SE {}.fastq {}.trimmed.fq ILLUMINACLIP:/global/projects/programs/source/Trimmomatic-0.38/adapters/TruSeq3-SE.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:30' # run STAR ls *.fq|sed 's/.trimmed.fq//g'|nice parallel --jobs 2 --eta 'STAR --runThreadN 12 --genomeDir ~/data/genomes/Cardiocondyla_obscurior/Cobs3.1/ncbi/annotations/STAR.100bp --readFilesIn {}.trimmed.fq --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix alignments/{}' # get only those with at least 80 cov ~/programs/subread-2.0.3-Linux-x86_64/bin/featureCounts -a Cobs3.1.clean.submitted_polished.agated.clean.gtf -o allLibraries.alignedSTAR.countReadPairs.s0 $(ls *.bam) -s 0 -T 25 -t gene >> featureCountsForStar.stdout.s0.txt 2>&1
Perform DGE analysis in R using limma following the developers steps here: https://www.bioconductor.org/packages/devel/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html#data-pre-processing
Then calculate pi or the plasticity index as described in (Schrader et al., 2017).
We used the Tandem Repeats Finder tool with recommended parameters (2 5 7 80 10 50 2000):
# run TRF: ~/programs/TRF/trf409.linux64 Cobs3.1.clean.fa 2 5 7 80 10 50 2000 -d -m # parse output: perl ~/programs/k-seek/TRF_parse.pl Cobs3.1.clean.fa.2.5.7.80.10.50.2000.dat Cobs3.1.clean.fa.2.5.7.80.10.50.2000.TRF_parse # build TR library: grep -v "^chr" Cobs3.1.clean.fa.2.5.7.80.10.50.2000.TRF_parse.txt|awk '{if ($4 >=50) print $0}' > Cobs3.1.clean.fa.2.5.7.80.10.50.2000.TRF_parse.txt.50bp.txt cut -f1-3,6 Cobs3.1.clean.fa.2.5.7.80.10.50.2000.TRF_parse.txt.50bp.txt|awk '{print ">"$1"_"$2"_"$3"\n"$4}' > Cobs3.1.clean.fa.2.5.7.80.10.50.2000.TRF_parse.txt.50bp.fa # mask genome with repeat RepeatMasker /global/projects/programs/source/RepeatMasker/RepeatMasker -s -norna -gff -u -pa 20 -a -nolow -dir RM_nolow_norna/ -lib Cobs3.1.clean.fa.2.5.7.80.10.50.2000.TRF_parse.txt.50bp.fa Cobs3.1.clean.fa # quantify proportion of masked bases per 250 kb window: bedtools nuc -fi Cobs3.1.clean.fa.masked -bed 250kbwindows.bed > nuc.Cobs3.1.clean.submitted.fa.masked.nolow_norna_50bp