--- title: "Introduction of 'geneHapR'" author: "Zhang RenLiang" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction of 'geneHapR'} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: sentence `geneHapR` is designed for gene haplotype statistics, phenotype association and visualization. ### 1. DATA #### 1.1 Input data Dataset required for haplotype statistic, visualization and phenotype association and the import function were listed in Table 1. The genotype dataset is essential for haplotype identification and could be supplied in VCF, FASTA, P.link, HAPMAP and table format. The annotation were used for variants filtration and prepare schematic diagram. Detailed information of individuals include phenotype data, group/category information and geo-coordinates. The phenotype data was used for comparison between different haplotypes. The group /category information was used for pie plot with haplotype network (eg. the second column in Table 4). And the geo-coordinates only used for demonstration of geographical distribution and include two columns: longitude and latitude (eg. the third and fourth column in Table 4). **Table 1: The required format of dataset and import functions for geneHapR** | Dataset | File format | Import function | |------------------|---------------------------|---------------------------| | **Genotype**
(necessary) | VCF: \*.vcf, \*.vcf.gz;
Sequences: \*.fa, *.fasta;*
p.link: (\*.ped & \*.map);
hmp: \*.hmp;
table (eg. Table2): .txt, \*.csv | *import_vcf();*
import_seqs();
import_plink.pedmap();
import_hmp();
read.table(), read.csv() | | **Annotation**
(optional) | GFF: *.gff,* .gff3,
BED4/BED6 (eg. Table3): \*.bed | *import_gff()*
import_bed() | | **Accession information**
(optional) | table (eg. Table4): *.txt,* .csv | *import_AccINFO()* | Table 2 is an example of genotypic data in table format: The first five column are fixed as chromosome name (CHROM), position (POS), reference nucleotide (REF), alter nucleotide (ALT) and additional information (INFO). Accession genotype should be in followed columns. "-" will be treated as Indel. "." and "N" will be treated as missing data. Field in additional information column should be in format "tag=value", and separated by semicolon ";". Heterozygote should be looks like "A/G" or "A\|G". **Table 2: Table format of the genotypic dataset** | CHR | POS | REF | Alt | INFO | C001 | C002 | C003 | ... | |:----:|:-------:|:---:|:---:|:----------------:|:----:|:----:|:----:|-----| | Chr7 | 9154754 | T | C | CDS=G\>A;AA=V\>G | T | T | T | ... | | Chr7 | 9154664 | G | T | CDS=A\>C | G | G | G | ... | | Chr7 | 9154489 | C | G | CDS=C\>G | C | C | C | ... | | Chr7 | 9154469 | G | A | CDS=T\>C | G | G | G | ... | | ︙ | ︙ | ︙ | ︙ | ︙ | ︙ | ︙ | ︙ | | Table 3 is an example of annotation file in BED6 format. As described at [UCSC](http://genome.ucsc.edu/FAQ/FAQformat.html#format1), the BED6 file contains 6 columns: 1) chromosome name, 2) chromosome start, 3) chromosome end, 4) name, 5) score and 6) strand. The BED4 contains the first 4 column of BED6. ***BE NOTE THAT***: the fourth column was used to define the name and types, which were separated by a space. For example, the first line of Table 3 indicates that: the genomic interval from 9154280 (exclude) to 9154821 (include) on Chr7 chromosome is CDS of "LOC_Os07g15770.1" and the strand is "negative". **Table 3: An annotation example in BED6 format** | \# CHROM | START | END | GENEID TYPE | . | STRAND | |:------:|:------:|:------:|:-----------------------------:|:------:|:------:| | Chr7 | 9154380 | 9154821 | LOC_Os07g15770.1**·**CDS | . | \- | | Chr7 | 9152403 | 9152730 | LOC_Os07g15770.1**·**CDS | . | \- | Note: the red dot in fourth column indicate a space. Table 4 is an example of detailed information of individuals, includes group/category, geo-coordinates and phenotype data. First column are names of accessions/individuals, phenotypic information are listed in followed columns. **Table 4: An example of accession detailed information dataset** | id | Subpopulation | Longitude | Latitude | Grain
length | Grain
width | Grain
thickness | |:----------|-----------|:----------|:----------|:----------|:----------|:----------| | C001 | Indica | 121 | 14.6 | 8.5 | 2.9 | 1.96 | | C002 | Intermediate | 121 | 14.6 | 10.2 | 2.63 | 1.96 | | C003 | Japonica | 51.3 | 35.45 | 8.75 | 3.32 | 2.12 | | C004 | Japonica | 116.28 | 39.54 | 7.83 | 3.22 | 2.08 | | C005 | Japonica | 121 | 14.6 | 10.47 | 3 | 1.95 | | C006 | Indica | 116.28 | 39.54 | 8.1 | 2.47 | 1.69 | #### 1.2 Output data The main results are `hapResult` and `hapSummary` class in R, consist of a matrix which could be divided into three parts as shown in Fig.1, and some additional attributes. ```{r Out Put Format1, echo=FALSE, fig.align='center', fig.cap="Cartoon representation of hapResult and hapSummary contents", fig.height=4, fig.width=6} plot(c(0,5),c(0,5), axes = FALSE, type = "n", xlab="", ylab ="", frame.plot = F) rect(xleft=0, ybottom=0, xright=0.5, ytop=4.5) rect(xleft=0.5, ybottom=3.5, xright=4, ytop=4.5) rect(xleft=0.5, ybottom=0, xright=4, ytop=3.5) rect(xleft=4, ybottom=0, xright=5, ytop=3.5) text(0.25, 4.75, "Part I", cex=1) text(2.25, 4.75, "Part II", cex=1) text(4.5, 4.75, "Part III", cex=1) text(0.25, 3, "Lead column", cex=1, srt = 270) text(2.25, 4.15, "Sites information", cex=1) text(2.25, 3.85, "(CHROM, POS, INFO, ALLELE)", cex=0.8) text(2.25, 2, "Genotypes", cex=1) text(4.5, 2, "Accessions (freq)", cex=1, srt= 270) ``` Part I consists of only one column. And the first four lines were fixed as CHROM (chromosome name), POS (position), INFO (additional information) and ALLELE (allele). And followed lines are names of each haplotype. Part II consists of at least one column, contains site information (first four lines) and genotypes (followed lines). The part III of `hapResult` consists of one column named as Accession, while `hapSummary` consists of two columns named as Accession and freq (frequency of each haplotype). The differences between `hapResult` and `hapSummary` is that each line of `hapResult` indicate an accession/individual, and each line in `hapSummary` indicate a haplotype. ```{r include=FALSE} NOT_CRAN <- identical(tolower(Sys.getenv("NOT_CRAN")), "true") knitr::opts_chunk$set(purl = NOT_CRAN) ``` ## Installation `geneHapR` is schemed to submit to CRAN. If accepted, this package could be installed with `install.packages("geneHapR")`. `geneHapR` has not published yet, if you use `geneHapR` in your study, please contact [Zhang RenLiang (Maintainer) (email: zhang_renliang\@163.com)](mailto:%20zhang_renliang@163.com) or [Jia GuanQing (jiaguanqing\@caas.cn)](mailto:%20jiaguanqing@caas.cn) ```{r eval=FALSE} install.packages("geneHapR") ``` ## Data input The first step is library the `geneHapR` packages. I will use the test data inside this package as an example for how to perform statistics of a gene/range, visualization and phenotype association analysis. ```{r startup, eval=NOT_CRAN, echo=TRUE, message=FALSE, warning=FALSE} library(geneHapR) ``` ```{r setups, eval=NOT_CRAN, include=FALSE} data("geneHapR_test") ``` There are two options to conduct a gene haplotype analysis starts from a VCF file or DNA sequences file. Thus a **VCF** file or **DNA sequences** file is necessary. However, the *GFF*, *phenos* and *accession groups* are strongly recommend for visualization and phenotype associations. The import functions takes file path as input. `import_vcf()` could import VCF file with surfix of ".vcf" and ".vcf.gz". `import_gff()` import file format default as "GFF" and `import_seqs()` file format default as "fasta". ### Import VCF ```{r import vcf, eval=FALSE, include=TRUE} # import vcf file vcf <- import_vcf("your_vcf_file_path.vcf") # import gziped vcf file vcf <- import_vcf("your_vcf_file_path.vcf.gz") ``` ### Import p.link (ped/map) ```{r import p.link, eval=FALSE, include=TRUE} plink <- import_plink.pedmap(mapfile = "p_link.map", pedfile = "p_link.ped", sep_ped = "\t", sep_map = "\t") plink <- import_plink.pedmap(root = "p_link", sep_ped = "\t", sep_map = "\t") ``` ### Import GFF ```{r import gff, eval=FALSE, include=TRUE} # import GFFs gff <- import_gff("your_gff_file_path.gff", format = "GFF") ``` ### Import BED ```{r import bed, eval=FALSE, include=TRUE} # import GFFs bed <- import_bed("your_gff_file_path.bed") ``` ### Import DNA sequences ```{r import DNA seqs, eval=FALSE, include=TRUE} # import DNA sequences in fasta format seqs <- import_seqs("your_DNA_seq_file_path.fa", format = "fasta") ``` ## Import phenotype and accession group information ```{r import phenotype and accession groups, eval=FALSE, include=TRUE} # import phynotype data pheno <- import_AccINFO("your_pheno_file_path.txt") pheno ``` ```{r eval=NOT_CRAN, echo=FALSE} head(pheno) ``` ```{r import pheno, eval=FALSE} # import accession group/location information AccINFO <- import_AccINFO("accession_group_file_path.txt") ``` ```{r eval=NOT_CRAN, echo=FALSE} head(AccINFO) ``` Be aware that the phenotype and accession group are effectively tables. There are more than one ways to import a table format file with `R`. Be **Note** that: a. the accession/individual names located in first column; b. the first row contents phenotype/accession_group names; c. `NA` is allowed, it's not a wise option to replace `NA` by `0`. eg. ```{r import pheno and accessions_groups, eval=FALSE, include=TRUE} # import pheno from space ' ' delimed table pheno <- read.table("your_pheno_file_path.csv", header = TRUE, row.names = 1, comment.char = "#") # import pheno from ',' delimed table pheno <- read.csv("your_pheno_file_path.csv", header = TRUE, comment.char = "#") ``` ## Data manipulations There is a little work need to be done before haplotype calculations: (1) VCF filtration and (2) DNA sequences alignment. ### VCF filtration There are three modes to filter a `vcfR` object after import VCF into 'R': a. by position; b. by annotation; c. by both of them. ```{r VCF filtration, eval=FALSE, include=TRUE} # filter VCF by position vcf_f1 <- filter_vcf(vcf, mode = "POS", Chr = "scaffold_1", start = 4300, end = 5890) # filter VCF by annotation vcf_f2 <- filter_vcf(vcf, mode = "type", gff = gff, type = "CDS") # filter VCF by position and annotation vcf_f3 <- filter_vcf(vcf, mode = "both", Chr = "scaffold_1", start = 4300, end = 5890, gff = gff, type = "CDS") ``` It's a time consuming work to import and manipulate a very large file with 'R' on personal computer. It'll be more efficiency to extract the target ranges from origin VCF with `filterLargeVCF()` before import. If your VCF file is just a few 'MB', this step was not necessary at all. **Note:** if extract more than one ranges, length of output file names (`VCFout`) must be equal with `Chr` and `POS`. ```{r preprocess large VCF, eval=FALSE, include=TRUE} # new VCF file will be saved to disk # extract a single gene/range from a large vcf filterLargeVCF(VCFin = "Ori.vcf.gz", VCFout = "filtered.vcf.gz", Chr = "scaffold_8", POS = c(19802,24501), override = TRUE) # extract multi genes/ranges from large vcf filterLargeVCF(VCFin = "Ori.vcf.gz", # surfix should be .vcf.gz or .vcf VCFout = c("filtered1.vcf.gz", # surfix should be .vcf.gz or .vcf "filtered2.vcf.gz", "filtered3.vcf.gz"), Chr = c("scaffold_8", "scaffold_8", "scaffold_7"), POS = list(c(19802,24501), c(27341,28949), c(38469,40344)), override = TRUE) # if TRUE, existed file will be override without warning ``` ### p.link (ped/map) filtration ```{r p.link filtration, eval=FALSE, message=FALSE, warning=FALSE, include=TRUE} p.link <- filter_plink.pedmap(p.link, mode = "POS", Chr = "Chr08", start = 25947258, end = 25948258) ``` ### DNA sequences manipulation The origin DNA sequences must be aligned and trimmed due to haplotype calculation need all sequences have same length. Those operations could be done with `geneHapR`. I still suggest users align and trim DNA sequences with **Mega** software and then save the result as *FASTA* format before import them into 'R'. ```{r DNA alignment and trim, eval=FALSE, message=FALSE, warning=FALSE, include=TRUE} # sequences alignment seqs <- allignSeqs(seqs, quiet = TRUE) # sequences trim seqs <- trimSeqs(seqs, minFlankFraction = 0.1) seqs ``` ### hapResult/hapSummary filtration ```{r hap filtration, eval=FALSE, message=FALSE, warning=FALSE, include=TRUE} hap <- filter_hap(hapSummary, rm.mode = c("position", "accession", "haplotype", "freq"), position.rm = c(4879, 4950), accession.rm = c("C1", "C9"), haplotype.rm = c("H009", "H008"), freq.min = 5) ``` ## Haplotype calculation As mentioned before, haplotype could be calculated from VCF or sequences with `vcf2hap()` or `seqs2hap()`. The genotype of most sites should be known and homozygous, still, a few site are unknown or heterozygous due to chromosome variant or error cased by sequencing or SNP calling or gaps or other reasons. It's a hard decision whether to drop accessions/individuals contains heterozygous or unknown sites for every haplotype analysis. Hence, I leave the choice to users. Calculate haplotype result from VCF. ```{r haplotype calculation from vcf, eval=NOT_CRAN, include=TRUE, message=FALSE, warning=FALSE} hapResult <- vcf2hap(vcf, hapPrefix = "H", hetero_remove = TRUE, na_drop = TRUE) hapResult ``` Calculate haplotype result from aligned DNA sequences. ```{r haplotype calculation from seqs, eval=FALSE} hapResult <- seqs2hap(seqs, Ref = names(seqs)[1], hapPrefix = "H", hetero_remove = TRUE, na_drop = TRUE, maxGapsPerSeq = 0.25) ``` ## Adjustment of `hapResult` Before visualization, there were a few details need to be adjusted. eg. add annotations and adjust position of "ATG" ### Add annotations to `hapResult` While `hapResult` was calculated from `vcfR` object, the **INFO** was taken from `@fix` field. The *VCF INFO* may missing some annotations. or contents format was inappropriate to display. Further more, **INFO** contents nothing if `hapResult` was generated from sequences. Here, we can introduce/replace the origin **INFO** by `addINFO()`. **Note** that: length of `values` must be equal with number of sites. Let's see how mant sites contains in the `hapResult`. ```{r check site numbers, eval=NOT_CRAN} # Chech number of sites conclude in hapResult sites(hapResult) ``` Now we replace the old INFO field with new tag named as "PrChange". ```{r replace INFO, eval=NOT_CRAN, include=TRUE, message=FALSE, warning=FALSE} # add annotations to INFO field hapResult <- addINFO(hapResult, tag = "PrChange", values = rep(c("C->D", "V->R", "G->N"),3), replace = TRUE) ``` Here, we add a tag named as "CDSChange" followed the old INFO. ```{r add INFO, eval=NOT_CRAN, include=TRUE, message=FALSE, warning=FALSE} # To replace the origin INFO by set 'replace' as TRUE hapResult <- addINFO(hapResult, tag = "CDSChange", values = rep(c("C->A", "T->C", "G->T"),3), replace = FALSE) ``` ### Adjust position of "ATG" This function was only used to adjust the position of "ATG" to 0 and hence convert the gene on negative strand to positive strand. Be note that: **GFF** and **hapResult** need to adjust position of *ATG* with the same parameters. ```{r ATG position, eval=FALSE} # set ATG position as zero in gff newgff <- gffSetATGas0(gff = gff, hap = hapResult, geneID = "test1G0387", Chr = "scaffold_1", POS = c(4300, 7910)) # set position of ATG as zero in hapResult/hapSummary newhap <- hapSetATGas0(gff = gff, hap = hapResult, geneID = "test1G0387", Chr = "scaffold_1", POS = c(4300, 7910)) ``` ## `hapResult` summary and visualization Once we have the `hapResult` object, can we summarize and visualize our `hapResult` by interact with annotations and phenotypes. ### Summary hapResult Now, we have the `hapResult` object with INFOs we want display in next step. The `hap_summary()` function convert the object of `hapResult` class, which is a long table format, into a short table belong to `hapSummary` class. In `hapResult` each row represent a accession, while each row represents a hap in `hapSummary`. ```{r hapSummary, eval=NOT_CRAN, fig.height=4, fig.width=6, message=FALSE, warning=FALSE, paged.print=FALSE} hapSummary <- hap_summary(hapResult) hapSummary ``` ### Visualize haplotye as table Let's see how to visualization of our haplotype results. At first let's display the `hapSummary` as a table. In this table like figure we can see all the variants and their positions, haplotypes and their frequencies. ```{r, eval=NOT_CRAN, fig.height=4, fig.width=6, message=FALSE, warning=FALSE, paged.print=FALSE} plotHapTable(hapSummary) ``` Also we can add an annotation, "CDSChange", to the table by assign the `INFO_tag`. It's your responsibility to verify whether the INFO_tag was existed in the INFO field. ```{r, eval=NOT_CRAN, fig.height=4, fig.width=6, message=FALSE, warning=FALSE, paged.print=FALSE} # add one annotation plotHapTable(hapSummary, hapPrefix = "H", INFO_tag = "CDSChange", tag_name = "CDS", displayIndelSize = 1, angle = 45, replaceMultiAllele = TRUE, ALLELE.color = "grey90") ``` Now let's add another `INFO_tag` named as "PrChange". ```{r , eval=NOT_CRAN, fig.height=4, fig.width=6, message=FALSE, warning=FALSE, paged.print=FALSE} # add multi annotation plotHapTable(hapSummary, hapPrefix = "H", INFO_tag = c("CDSChange", "PrChange"), displayIndelSize = 1, angle = 45, replaceMultiAllele = TRUE, ALLELE.color = "grey90") ``` Parameter `tag_name` was used to replace the character if `INFO_tag` was too long. ```{r, eval=NOT_CRAN, fig.height=4, fig.width=6, message=FALSE, warning=FALSE, paged.print=FALSE} # add multi annotation plotHapTable(hapSummary, hapPrefix = "H", INFO_tag = c("CDSChange", "PrChange"), tag_name = c("CDS", "Pr"), displayIndelSize = 1, angle = 45, replaceMultiAllele = TRUE, ALLELE.color = "grey90") ``` ### Display variations on gene model. I think it's a good idea to figure out where are the variants by marking them on gene model. ```{r display variations on gene model, eval=NOT_CRAN, fig.height=4, fig.width=6, message=FALSE, warning=FALSE, paged.print=FALSE} displayVarOnGeneModel(hapSummary, gff, Chr = "scaffold_1", startPOS = 4300, endPOS = 7910, type = "pin", cex = 0.7, CDS_h = 0.05, fiveUTR_h = 0.02, threeUTR_h = 0.01) ``` ### `hapNet` calculation and visualization The `hapNet` could be generated from object of `hapSummary` class. The accession group information could be attached in this step. ```{r hapNet, eval=NOT_CRAN} hapNet <- get_hapNet(hapSummary, AccINFO = AccINFO, groupName = "Type") ``` Once we have the `hapNet` object, we can plot it with 'R'. ```{r, eval=NOT_CRAN, fig.height=6, fig.width=7, message=FALSE, warning=FALSE, paged.print=FALSE} # plot haploNet plotHapNet(hapNet, size = "freq", # circle size scale = "log2", # scale circle with 'log10(size + 1)' cex = 0.8, # size of hap symbol col.link = 2, # link colors link.width = 2, # link widths show.mutation = 2, # mutation types one of c(0,1,2,3) legend = c(-12.5, 7)) # legend position ``` ## Geography distribution of main haplotypes Now we get the haplotype result. There is a new question emerged: how did those main haplotypes distributed, are they related to geography? ```{r geo distribution, eval=NOT_CRAN, fig.height=4, fig.width=6, message=FALSE, warning=FALSE, paged.print=FALSE} # library(mapdata) # library(maptools) hapDistribution(hapResult, AccINFO = AccINFO, LON.col = "longitude", LAT.col = "latitude", hapNames = c("H001", "H002", "H003"), legend = TRUE) ``` ### Phenotype association analysis Finally, let's see which haplotype has superiority at particular area by interact with phynotype. Here are two options, merged or separated, to organized the heatmap of p-values and violin plot. The figure as an object of `ggplot2`, which means user could add/modified figure elements with `ggplot2`. Here is an example for merged arrangement: ```{r hapVsPheno merged, eval=NOT_CRAN, fig.height=4, fig.width=10, message=FALSE, warning=FALSE, paged.print=FALSE} results <-hapVsPheno(hapResult, hapPrefix = "H", title = "This is title", mergeFigs = TRUE, pheno = pheno, phenoName = "GrainWeight.2021", minAcc = 3) plot(results$figs) ``` **An example for separated plot:** ```{r hapVsPheno separated, eval=NOT_CRAN, fig.height=4, fig.width=6, message=FALSE, warning=FALSE, paged.print=FALSE} results <- hapVsPheno(hap = hapResult, hapPrefix = "H", title = "This is title", pheno = pheno, phenoName = "GrainWeight.2021", minAcc = 3, mergeFigs = FALSE) plot(results$fig_pvalue) plot(results$fig_Violin) ``` ### Association analysis of multi-phenotypes once a time I believe the function of `hapVsPhenos()` will be useful there are a lot of phenotype need to be associated with haplotype results. Note that: the pheno name will be added between the file name and surfix. ```{r eval=FALSE} hapVsPhenos(hapResult, pheno, outPutSingleFile = TRUE, hapPrefix = "H", title = "Seita.0G000000", file = "mypheno.tiff", width = 12, height = 8, res = 300) ```