Title: | Computation of Ancestry Scores with Mixed Families and Unrelated Individuals |
---|---|
Description: | We provide several algorithms to compute the genotype ancestry scores (such as eigenvector projections) in the case where highly correlated individuals are involved. |
Authors: | Yi-Hui Zhou |
Maintainer: | Yi-Hui Zhou <[email protected]> |
License: | GPL-2 |
Version: | 1.0 |
Built: | 2025-02-16 04:32:20 UTC |
Source: | https://github.com/cran/PCFAM |
This package provides ancestry scores based on genotype data, and is robust to the presence of close-degree family members. Four main novel algorithms are represented: (i) Geometric rotation (within-family data orthogonalization); (ii) matrix substitution based on the decomposition of a target family-orthogonalized covariance matrix; (iii) covariance-preserving whitening, retaining covariances between unrelated pairs while orthogonalizing family members (Note: the function perfectwhiten generates a new dataset which keeps the same covariance structure as the original set); (iv) using family-averaged data to obtain loadings for projection of family members.
Package: | PCFAM |
Type: | Package |
Version: | 1.0 |
Date: | 2016-10-11 |
License: | GPL 2 |
LazyLoad: | yes |
Yi-Hui Zhou
Maintainer: Yi-Hui Zhou <[email protected]>
Computation of ancestry scores with mixed families and unrelated individuals. arXiv:1606.08416
X <- matrix(rbinom(1000*20,2,0.4),1000,20) X[,1]=X[,2]*0.9 X=rowscale(X) Xresid=residualize(X) corXresid=cor(Xresid) myfam=findfamilies(corXresid,0.1) K=3 myms.pca=ms.pca(X,corXresid,0.1,K) familyave.result=familyave(X,myfam,top=K)
X <- matrix(rbinom(1000*20,2,0.4),1000,20) X[,1]=X[,2]*0.9 X=rowscale(X) Xresid=residualize(X) corXresid=cor(Xresid) myfam=findfamilies(corXresid,0.1) K=3 myms.pca=ms.pca(X,corXresid,0.1,K) familyave.result=familyave(X,myfam,top=K)
This function centerizes each column of the data matrix
colcenter(X)
colcenter(X)
X |
input data matrix |
return the data matrix with each column centered
Yi-Hui Zhou
Computation of ancestry scores with mixed families and unrelated individuals. Yi-Hui Zhou, J.S. Marron, Fred Wright, arXiv:1606.08416.
Obtain a sample covariance matrix
cov.function(data.matrix)
cov.function(data.matrix)
data.matrix |
Input mxn data matrix |
return the nxn sample covariance matrix
Yi-Hui Zhou
Computation of ancestry scores with mixed families and unrelated individuals. arXiv:1606.08416.
X <- matrix(rbinom(1000*20,2,0.4),1000,20) cov.X=cov.function(X)
X <- matrix(rbinom(1000*20,2,0.4),1000,20) cov.X=cov.function(X)
This function implements the family-averaging algorithm, with loadings based on the combined data from singletons and family averages, then projected to all.
familyave(Xall,myfam, top = 5)
familyave(Xall,myfam, top = 5)
Xall |
The original input genotype dataset |
myfam |
The identified family IDs. Each singleton forms his/her own family. |
top |
The number ancestry scores desired. |
The function averages the genotype information in each family, re-inflates to have appropriate variability, andtreats as a 'singleton' for the purpose of loading calculation. Ancestry scores are obtained by projection to all.
Output the top ancestry scores by combining family data with singletons
Yi-Hui Zhou
Computation of ancestry scores with mixed families and unrelated individuals. arXiv:1606.08416.
X <- matrix(rbinom(1000*20,2,0.4),1000,20) X[,1]=X[,2]*0.9 X=rowscale(X) Xresid=residualize(X) corXresid=cor(Xresid) myfam=findfamilies(corXresid,0.1) K=3 familyave.result=familyave(X,myfam,top=K)
X <- matrix(rbinom(1000*20,2,0.4),1000,20) X[,1]=X[,2]*0.9 X=rowscale(X) Xresid=residualize(X) corXresid=cor(Xresid) myfam=findfamilies(corXresid,0.1) K=3 familyave.result=familyave(X,myfam,top=K)
This function can generate covariance matrix faster than the regular cov() function.
fastcov(X)
fastcov(X)
X |
input mxn data matrix |
Output nxn covariance matrix
The input data matrix has to be column scaled in advance.
Yi-Hui Zhou,
Computation of ancestry scores with mixed families and unrelated individuals. arXiv:1606.08416.
This function searches for pairs of individuals with high kinship based on the genotype correlation matrix.
findfamilies(x, threshold = 0.4)
findfamilies(x, threshold = 0.4)
x |
The nxn correlation matrix of the input dataset. |
threshold |
This threshold is used to identify close-degree relatives. Recommended values are 0.4 to identify first-degree relatives, and 0.15 to identify first- and second-degree relatives. |
Output numerical family ID for each individual. Individuals with the same ID are judged to be family members.
Yi-Hui Zhou
Computation of ancestry scores with mixed families and unrelated individuals. arXiv:1606.08416.
This algorithm rotates scaled genotypes among family members so that they are mutually orthogonal.
gr.pca(data.input, index.family, myfam, weight, top, family.size, inflation)
gr.pca(data.input, index.family, myfam, weight, top, family.size, inflation)
data.input |
Input dataset, each row is for a genetic feature (SNP), each column is for individual. Data are typically number of minor alleles, possibly imputed. |
index.family |
Index vector to indicate the family id of each individual. |
myfam |
This value comes directly from the output of findfamilies(). |
weight |
Weight is 0 by default. This is a deprecated weight value that can be used to control the amount of rotation performed. A weight of zero performs full orthogonalization, while a weight of 1 keeps the data unchanged. |
top |
The number of eigenvectors to be used. |
family.size |
The number of members in each family. Used to determine rotation angles. |
inflation |
The inflation of the data value is 0 under default. Deprecated. |
data.new |
The new datamatrix after the geometric rotation |
topPCs |
The top eigenvectors |
topEigenvalue |
The top eigenvalues. |
Yi-Hui Zhou
Computation of ancestry scores with mixed families and unrelated individuals. arXiv:1606.08416.
This function provides the matrix subsitution algorithm. The main idea is to replace the high covariance value entries in the covariance matrix which are produced by family members by a small value (e.g. median covariance).
ms.pca(X, corXresid, threshold, top)
ms.pca(X, corXresid, threshold, top)
X |
The input data matrix |
corXresid |
The correlation of the genotypes after residualization for any evidence of larger scale ancestry. Used to identify close-degree family members in a manner robust to large-scale ancestry. |
threshold |
Covariance values of identified family members are set to the threshold. |
top |
The number of ancestry scores to obtain. |
eigenvector |
Eigenvectors after using the matrix substitution method |
myeigen |
The top eigenvalues and eigenvectors |
Yi-Hui Zhou
Computation of ancestry scores with mixed families and unrelated individuals. arXiv:1606.08416.
This function can find the matrix square root, without requiring a new package and often faster than other code.
mysqrtm(a, symmetric = F)
mysqrtm(a, symmetric = F)
a |
The input matrix |
symmetric |
Default=FALSE. This argument indicates whether the input matrix is symmetric. |
Matrix B is said to be a square root of A if the matrix product BB is equal to A.
returns the square root matrix B
This algorithm generates a new scaled 'genotype' dataset which keeps the same covariance structure as the original data, except that family members have been made orthogonal to each other, and singletons are unchanged.
perfectwhiten(Xun, Xfam, delta = 3e-04, threshold = 0.35, eta = NULL, addfuzz = F)
perfectwhiten(Xun, Xfam, delta = 3e-04, threshold = 0.35, eta = NULL, addfuzz = F)
Xun |
A matrix of (possibly scaled) genotypes, (number of SNPs)*(number of singletons) |
Xfam |
A matrix of (possibly scaled) genotypes, (number of SNPs)*(number of individuals belonging to families) |
delta |
A slight offset used to ensure that the target covariance matrix is of full rank |
threshold |
The correlation threshold used to determine pairs of relatives. The choice should be less than the degree desired. For example, 0.35 captures first degree relatives (expected correlation 0.5), 0.15 captures first and second degree relatives (expected correlation for second degree relatives is 0.25). |
eta |
This argument is the replacement value used for matrix substitution. The default is NULL, resulting in substitution by the median. |
addfuzz |
The default is FALSE. Deprecated. |
Xplusscaled |
The row-scaled full genotype data, including both singletons and family members |
Y |
The (scaled) genotype matrix after whitening, and should have a covariance matrix very close to Mtarget. Column means are zero |
Ynotcolcentered |
The same as Y, but with column means matching those of Xplusscaled |
M |
The covariance matrix of the full data |
Mtilde |
The covariance matrix after matrix substitution of all family pairs identified with correlations exceedingeta |
whichbig |
The set of indexes of M that have correlation exceeding threshold |
covY |
The covariance matrix of Y, useful to compare to M or to Mtarget |
Yi-Hui ZHou, Fred A. Wright
Computation of ancestry scores with mixed families and unrelated individuals. arXiv:1606.08416.
X <- matrix(rbinom(1000*20,2,0.4),1000,20) X[,1]=X[,2]*0.9 X=rowscale(X) Xresid=residualize(X) library(PCFAM) corXresid=cor(Xresid) myfam=findfamilies(corXresid,0.1) K=3 perfect.result=perfectwhiten(X[,which(myfam==0)],X[,which(myfam==1)])
X <- matrix(rbinom(1000*20,2,0.4),1000,20) X[,1]=X[,2]*0.9 X=rowscale(X) Xresid=residualize(X) library(PCFAM) corXresid=cor(Xresid) myfam=findfamilies(corXresid,0.1) K=3 perfect.result=perfectwhiten(X[,which(myfam==0)],X[,which(myfam==1)])
Thus function performs a simple residualization of a row-scaled genotype dataset, removing large-scale population stratification. Output is a residualized dataset appropriate for computing correlations such that family members can be easily identified. The function assumes X is row-scaled
residualize(X)
residualize(X)
X |
The original input genotype dataset |
This function pre-treatment the data before applying the findfamily function.
Outputs the new row-scaled genotype matrix after residualization
Yi-Hui Zhou
Computation of ancestry scores with mixed families and unrelated individuals. arXiv:1606.08416.
This function identifies the rows and columns of elements in a matrix, e.g. the family members identified based on the correlation matrix.
rowcol(I, J, elements)
rowcol(I, J, elements)
I |
The number of rows of the matrix (scalar) |
J |
The number of columns of the matrix (scalar) |
elements |
A vector of matrix element indexes |
whichrow |
The rows of elements in the matrix |
whichcol |
The columns of elements in the matrix |
Yi-Hui ZHou, Fred A. Wright
Computation of ancestry scores with mixed families and unrelated individuals. arXiv:1606.08416.
X <- matrix(rbinom(1000*20,2,0.4),1000,20) X[,1]=X[,2]*0.9 X=rowscale(X)
X <- matrix(rbinom(1000*20,2,0.4),1000,20) X[,1]=X[,2]*0.9 X=rowscale(X)
This function scales the input matrix so that each row mean is 0 and each row (sample) variance is 1.
rowscale(X)
rowscale(X)
X |
input data matrix |
Output the row-scaled matrix.
Yi-Hui ZHou, Fred A. Wright
Computation of ancestry scores with mixed families and unrelated individuals. arXiv:1606.08416.