Documentation for local_pcangsd#

Installation#

The easiest way to install dependencies and local_pcangsd is with a conda environment defined in the yaml recipe

conda env create -f conda_env.yaml
conda activate local_pcangsd
pip install git+https://github.com/alxsimon/local_pcangsd.git

Quickstart#

  1. Transform your genotype likelihood file with local_pcangsd.beagle_to_zarr()

  2. Open your converted dataset using local_pcangsd.load_dataset()

  3. Create windows with local_pcangsd.window()

  4. Run PCAngsd on each window with local_pcangsd.pca_window()

  5. Open the PCA dataset using local_pcangsd.load_dataset()

  6. Convert to the python lostruct format using local_pcangsd.to_lostruct()

  7. Use the data any way you want.

Please see this example for more details.

API#

Summary#

beagle_to_zarr(input, store[, chunksize])

Converts an ANGSD genotype likelihood dataset to a Zarr array on disk.

load_dataset(store, **kwargs)

Wrapper around xarray.open_zarr

window(ds, type, size[, min_variant_number])

Create windows on the dataset.

get_window_center(ds)

Estimate the center position of each window, useful for plotting.

pca_window(ds, store[, output_chunks, k, ...])

Run PCAngsd on each window.

to_lostruct(ds_pca)

Converts the local_pcangsd result to lostruct format

pcangsd_merged_windows(ds, windows_idx[, k, ...])

Compute PCAngsd on merged windows of interest.

Reference documentation#

Details on functions and their arguments.

local_pcangsd.beagle_to_zarr(input, store, chunksize=10000)#

Converts an ANGSD genotype likelihood dataset to a Zarr array on disk.

Parameters
  • input (str) – path to a genotype likelihood file in beagle format produced by ANGSD.

  • store (str) – output file on disk to store the dataset as a zarr file, ex: “output.zarr”.

  • chunksize (int) – size of each chunk in the variant dimension.

Return type

None

local_pcangsd.get_window_center(ds)#

Estimate the center position of each window, useful for plotting.

Parameters

ds (Dataset) – Dataset with windows

Returns

Array of window center position

Return type

numpy.ndarray

Raises

Exception – if the Dataset does not contain windows

local_pcangsd.load_dataset(store, **kwargs)#

Wrapper around xarray.open_zarr

Parameters
  • store (str) – path to zarr store

  • **kwargs – keyword arguments passed to xarray.open_zarr

Returns

the opened dataset

Return type

xarray.Dataset

local_pcangsd.pca_window(ds, store, output_chunks={'variants': 10000, 'windows': 100}, k=None, tmp_folder='/tmp/tmp_local_pcangsd', scheduler='threads', num_workers=None, clean_tmp=True, overwrite=True, min_maf=0.05, maf_iter=200, maf_tole=0.0001, n_eig=0, iter=100, tole=1e-05, pcangsd_threads=1)#

Run PCAngsd on each window.

Parameters
  • ds (Dataset) – local_pcangsd dataset containing genotype likelihoods and windows.

  • store (str) – path to store the local_pcangsd results.

  • output_chunks (dict) – size of chunks for the xarray output.

  • k (Optional[int]) – number of PCs to retain in the output. By default will keep all.

  • tmp_folder (str) – folder to use to store temporary results. Will be created if it does not exist. ‘/tmp/tmp_local_pcangsd’ by default.

  • scheduler (str) – dask single-machine scheduler to use. ‘threads’, ‘processes’ or ‘synchronous’.

  • num_workers (Optional[int]) – dask number of workers to use. Be careful to adapt pcangsd_threads and this argument accordingly.

  • clean_tmp (bool) – should the temporary folder by emptied? Useful for debugging.

  • overwrite (bool) – should tmp files be overwritten? Default to True.

  • min_maf (float) – pcangsd minMaf. Minimum allele frequency of sites to consider.

  • maf_iter (int) – pcangsd maf_iter argument.

  • maf_tole (float) – pcangsd maf_tole argument.

  • n_eig (int) – pcangsd n_eig argument.

  • iter (int) – pcangsd iter argument.

  • tole (float) – pcangsd tole argument.

  • pcangsd_threads (int) – pcangsd threads argument. Be careful to adapt num_workers and this argument accordingly.

Returns

Path to the created zarr_store containing each window pcangsd.

Return type

str

Raises

Exception – if window variables does not exist in ds.

local_pcangsd.pcangsd_merged_windows(ds, windows_idx, k=None, min_maf=0.05, maf_iter=200, maf_tole=0.0001, n_eig=0, iter=100, tole=1e-05, pcangsd_threads=1)#

Compute PCAngsd on merged windows of interest.

Parameters
  • ds (Dataset) – local_pcangsd dataset containing genotype likelihoods and windows.

  • windows_idx (array) – indexes of windows to merge.

  • k (Optional[int]) – number of PCs to retain in the output. By default will keep all.

  • min_maf (float) – pcangsd minMaf.

  • maf_iter (int) – pcangsd maf_iter argument.

  • maf_tole (float) – pcangsd maf_tole argument.

  • n_eig (int) – pcangsd n_eig argument.

  • iter (int) – pcangsd iter argument.

  • tole (float) – pcangsd tole argument.

  • pcangsd_threads (int) – pcangsd threads argument.

Returns

(covariance matrix, total variance, eigen values, eigen vectors)

Return type

tuple

Raises

Exception – if the dataset do not have windows variables.

local_pcangsd.to_lostruct(ds_pca)#

Converts the local_pcangsd result to lostruct format

Parameters

ds_pca (Dataset) – local_pcangsd PCA result dataset.

Returns

array in the lostruct format.

Return type

numpy.ndarray

local_pcangsd.window(ds, type, size, min_variant_number=100)#

Create windows on the dataset.

Wrapper arround sgkit.window_by_[…]. Size either in bp or number of variants depending on type. Drops empty windows.

Parameters
  • ds (Dataset) – input dataset.

  • type (str) – ‘position’ or ‘variant’. Create windows either using their position or using the variant number

  • size (int) – size of each window, either in bp or number of variants depending on type.

  • min_variant_number (int) – minimal number of variants to keep the window. Windows with less than min_variant_number variants are discarded.

Returns

ds input with appended windowing variables.

New dimension windows created. created variables:

  • window_contig

  • window_start

  • window_stop

  • window_used: boolean indicating if window is used following min_maf filtering.

Return type

xarray.Dataset

Raises

ValueError – if type is not ‘position’ or ‘variant’.

Indices and tables#