Documentation for local_pcangsd
Contents
Documentation for local_pcangsd#
Installation#
The easiest way to install dependencies and local_pcangsd
is with a conda environment defined in the
yaml recipe
conda env create -f conda_env.yaml
conda activate local_pcangsd
pip install git+https://github.com/alxsimon/local_pcangsd.git
Quickstart#
Transform your genotype likelihood file with
local_pcangsd.beagle_to_zarr()
Open your converted dataset using
local_pcangsd.load_dataset()
Create windows with
local_pcangsd.window()
Run PCAngsd on each window with
local_pcangsd.pca_window()
Open the PCA dataset using
local_pcangsd.load_dataset()
Convert to the python lostruct format using
local_pcangsd.to_lostruct()
Use the data any way you want.
Please see this example for more details.
API#
Summary#
|
Converts an ANGSD genotype likelihood dataset to a Zarr array on disk. |
|
Wrapper around xarray.open_zarr |
|
Create windows on the dataset. |
Estimate the center position of each window, useful for plotting. |
|
|
Run PCAngsd on each window. |
|
Converts the local_pcangsd result to lostruct format |
|
Compute PCAngsd on merged windows of interest. |
Reference documentation#
Details on functions and their arguments.
- local_pcangsd.beagle_to_zarr(input, store, chunksize=10000)#
Converts an ANGSD genotype likelihood dataset to a Zarr array on disk.
- local_pcangsd.get_window_center(ds)#
Estimate the center position of each window, useful for plotting.
- local_pcangsd.load_dataset(store, **kwargs)#
Wrapper around xarray.open_zarr
- Parameters
store (
str
) – path to zarr store**kwargs – keyword arguments passed to xarray.open_zarr
- Returns
the opened dataset
- Return type
- local_pcangsd.pca_window(ds, store, output_chunks={'variants': 10000, 'windows': 100}, k=None, tmp_folder='/tmp/tmp_local_pcangsd', scheduler='threads', num_workers=None, clean_tmp=True, overwrite=True, min_maf=0.05, maf_iter=200, maf_tole=0.0001, n_eig=0, iter=100, tole=1e-05, pcangsd_threads=1)#
Run PCAngsd on each window.
- Parameters
ds (
Dataset
) – local_pcangsd dataset containing genotype likelihoods and windows.store (
str
) – path to store the local_pcangsd results.output_chunks (
dict
) – size of chunks for the xarray output.k (
Optional
[int
]) – number of PCs to retain in the output. By default will keep all.tmp_folder (
str
) – folder to use to store temporary results. Will be created if it does not exist. ‘/tmp/tmp_local_pcangsd’ by default.scheduler (
str
) – dask single-machine scheduler to use. ‘threads’, ‘processes’ or ‘synchronous’.num_workers (
Optional
[int
]) – dask number of workers to use. Be careful to adapt pcangsd_threads and this argument accordingly.clean_tmp (
bool
) – should the temporary folder by emptied? Useful for debugging.overwrite (
bool
) – should tmp files be overwritten? Default to True.min_maf (
float
) – pcangsd minMaf. Minimum allele frequency of sites to consider.maf_iter (
int
) – pcangsd maf_iter argument.maf_tole (
float
) – pcangsd maf_tole argument.n_eig (
int
) – pcangsd n_eig argument.iter (
int
) – pcangsd iter argument.tole (
float
) – pcangsd tole argument.pcangsd_threads (
int
) – pcangsd threads argument. Be careful to adapt num_workers and this argument accordingly.
- Returns
Path to the created zarr_store containing each window pcangsd.
- Return type
- Raises
Exception – if window variables does not exist in ds.
- local_pcangsd.pcangsd_merged_windows(ds, windows_idx, k=None, min_maf=0.05, maf_iter=200, maf_tole=0.0001, n_eig=0, iter=100, tole=1e-05, pcangsd_threads=1)#
Compute PCAngsd on merged windows of interest.
- Parameters
ds (
Dataset
) – local_pcangsd dataset containing genotype likelihoods and windows.windows_idx (
array
) – indexes of windows to merge.k (
Optional
[int
]) – number of PCs to retain in the output. By default will keep all.min_maf (
float
) – pcangsd minMaf.maf_iter (
int
) – pcangsd maf_iter argument.maf_tole (
float
) – pcangsd maf_tole argument.n_eig (
int
) – pcangsd n_eig argument.iter (
int
) – pcangsd iter argument.tole (
float
) – pcangsd tole argument.pcangsd_threads (
int
) – pcangsd threads argument.
- Returns
(covariance matrix, total variance, eigen values, eigen vectors)
- Return type
- Raises
Exception – if the dataset do not have windows variables.
- local_pcangsd.to_lostruct(ds_pca)#
Converts the local_pcangsd result to lostruct format
- Parameters
ds_pca (
Dataset
) – local_pcangsd PCA result dataset.- Returns
array in the lostruct format.
- Return type
- local_pcangsd.window(ds, type, size, min_variant_number=100)#
Create windows on the dataset.
Wrapper arround sgkit.window_by_[…]. Size either in bp or number of variants depending on type. Drops empty windows.
- Parameters
ds (
Dataset
) – input dataset.type (
str
) – ‘position’ or ‘variant’. Create windows either using their position or using the variant numbersize (
int
) – size of each window, either in bp or number of variants depending on type.min_variant_number (
int
) – minimal number of variants to keep the window. Windows with less than min_variant_number variants are discarded.
- Returns
- ds input with appended windowing variables.
New dimension windows created. created variables:
window_contig
window_start
window_stop
window_used: boolean indicating if window is used following min_maf filtering.
- Return type
- Raises
ValueError – if type is not ‘position’ or ‘variant’.