noisets.noisettes
NoisET* NOIse sampling learning & Expansion detection of T-cell receptors using Bayesian inference.
High-throughput sequencing of T- and B-cell receptors makes it possible to track immune
repertoires across time, in different tissues, in acute and chronic diseases or in healthy individuals. However
quantitative comparison between repertoires is confounded by variability in the read count of each receptor
clonotype due to sampling, library preparation, and expression noise. We present an easy-to-use python
package NoisET that implements and generalizes a previously developed Bayesian method in Puelma Touzel et al, 2020. It can be used
to learn experimental noise models for repertoire sequencing from replicates, and to detect responding
clones following a stimulus. The package was tested on different repertoire sequencing technologies and
datasets. NoisET package is desribed here.
* NoisET should be pronounced as "noisettes" (ie hazelnuts in French).
Functions library for NoisET - construction of noisettes package
Copyright (C) 2021 Meriem Bensouda Koraichi.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see https://www.gnu.org/licenses/.
Installation
Python 3 NoisET is a python /3.6 software. It is available on PyPI and can be downloaded and installed through pip:
$ pip install noisets
Watch out, data pre-processing, diversity estimates and generation of neutral TCR clonal dynamics is not possible yet with installation with pip. Use only the sudo command below. To install NoisET and try the tutorial dusplayed in this github: gitclone the file in your working environment. Using the terminal, go to NoisET directory and write the following command :
$ sudo python setup.py install
If you do not have the following python libraries (that are useful to use NoisET) : numpy, pandas, matplotlib, seaborn, scipy, scikit-learn, please do the following commands, to try first to install the dependencies separately: :
python -m pip install -U pip
python -m pip install -U matplotlib
pip install numpy
pip install pandas
pip install matplotlib
pip install seaborn
pip install -U scikit-learn
Documentation
Command lines with terminal
A tutorial is available at https://github.com/mbensouda/NoisET_tutorial . Three commands are available to use :
noiset-noiseTo infer Null noise model: NoisET first function (1)noiset-nullgeneratorTo qualitatively check consistency of NoisET first functionnoiset-detectionTo detect responding clones to a stimulus: NoisET second function (2) All options are described typing one of the previous commands +--helpor-h. Options are also described in the following READme.
1/ Infer noise model
To infer null noise model: NoisET first function (1), use the command noiset-noise
At the command prompt, type:
$ noiset-noise --path 'DATA_REPO/' --f1 'FILENAME1_X_REP1' --f2 'FILENAME2_X_REP2' --(noisemodel)
Several options are needed to learn noise model from two replicate samples associated to one individual at a specific time point:
1/ Data information:
--path 'PATHTODATA': set path to data file--f1 'FILENAME1_X_REP1': filename for individual X replicate 1--f2 'FILENAME2_X_REP2': filename for individual X replicate 2 If your TCR CDR3 clonal populations features (ie clonal fractions, clonal counts, clonal nucleotide CDR3 sequences and clonal amino acid sequences) have different column names than: ('Clone fraction', 'Clone count', 'N. Seq. CDR3', 'AA. Seq. CDR3), you can specify the name directly by using:--specify--freq 'frequency': Column label associated to clonal fraction--counts 'counts': Column label associated to clonal count--ntCDR3 'ntCDR3': Column label associated to clonal CDR3 nucleotides sequence--AACDR3 'AACDR3': Column label associated to clonal CDR3 amino acid sequence2/ Choice of noise model: (parameters meaning described in Methods section)
--NBPoisson: Negative Binomial + Poisson Noise Model - 5 parameters--NB: Negative Binomial - 4 parameters--Poisson: Poisson - 2 parameters3/ Example:
At the command prompt, type:
$ noiset-noise --path 'data_examples/' --f1 'Q1_0_F1_.txt.gz' --f2 'Q1_0_F2_.txt.gz' --NB
This command line will learn four parameters associated to negative binomial null noise Model --NB for individual Q1 at day 0.
A '.txt' file is created in the working directory: it is a 5/4/2 parameters data-set regarding on NBP/NB/Poisson noise model. In this example, it is a four parameters table (already created in data_examples repository).
You can run previous examples using data (Q1 day 0/ day15) provided in the data_examples folder - data from Precise tracking of vaccine-responding T cell clones reveals convergent and personalized response in identical twins, Pogorelyy et al, PNAS
4/ Example with --specify:
At the command prompt, type:
$ noiset-noise --path 'data_examples/' --f1 'replicate_1_1.tsv.gz' --f2 'replicate_1_2.tsv.gz' --specify --freq 'frequencyCount' --counts 'count' --ntCDR3 'nucleotide' --AACDR3 'aminoAcid' --NB
As previously this command enables us to learn four parameters associated to negative binomial null noise model --NB for one individual in cohort produced in Model to improve specificity for identification of clinically-relevant expanded T cells in peripheral blood, Rytlewski et al, PLOS ONE.
2/ Generate synthetic data from null model learning:
To qualitatively check consistency of NoisET first function (1) with experiments or for other reasons, it can be useful to generates synthetic replicates from the null model (described in Methods section).
One can also generalte healthy RepSeq samples dynamics using the noise model which has been learned in a first step anf giving the time-scale dynamics of turnover of the repertoire as defined in https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. Check here.
To generate synthetic TCR RepSeq data replicates having chosen sampling noise characteristics, use the command noiset-nullgenerator
$ noiset-nullgenerator --(noise-model) --nullpara 'NULLPARAS' --NreadsI float --NreadsII float --Nclones float --output 'SYNTHETICDATA'
1/ Choice of noise model:
The user must chose one of the three possible models for the probability that a TCR has an empirical count n knowing that its true frequency is f , P(n|f): a Poisson distribution --Poisson, a negative binomial distribution --NB, or a two-step model combining Negative-Binomial and a Poisson distribution --NBP. n is the empirical clone size and depends on the experimental protocol.
For each P(n|f), a set of parameters is learned.
--NBPoisson: Negative Binomial + Poisson Noise Model - 5 parameters 5 parameters described in Puelma Touzel et al, 2020: power-law exponent of clonotypes frequencies distributions'alph_rho', minimum of clonotype frequencies distribution'fmin','beta'and'alpha', parameters of negative binomial distribution constraining mean and variance of P(m|f) distribution (m being the number of cells associated to a clonotype in the experiemental sample), and'm_total'the total number of cells in the sample of interest..--NB: Negative Binomial - 4 parameters: power-law of the clonotypes frequencies distributions (same ansatz than in Puelma Touzel et al, 2020'alph_rho', minimum of clonotype frequencies distribution'fmin','beta'and'alpha', parameters of negative binomial distribution constraining mean and variance of P(n|f) distribution. NB(fNreads, fNreads + betafNreadsalpha) . (Nreads is the total number of reads in the sample of interest.)--Poisson: Poisson - 2 parameters power-law of the clonotypes frequencies distributions (same ansatz than in Puelma Touzel et al, 2020'alph_rho'and minimum of clonotype frequencies distribution'fmin'. P(n|f) is a Poisson distribution of parameter fNreads . (Nreads is the total number of reads in the sample of interest.)
2/ Specify learned noise parameters:
--nullpara 'PATHTOFOLDER/NULLPARAS.txt': parameters learned thanks to NoisET function (1) !!! Make sure to match correctly the noise model and the null parameter file content : 5 parameters for--NBP, 4 parameters for--NBand 2 parameters for--Poisson.3/ Sequencing properties of data:
--NreadsI NNNN: total number of reads in first replicate - it should match the actual data. In the example below, it is the sum of 'Clone count' in 'Q1_0_F1_.txt.gz'.--Nreads2 NNNN: total number of reads in second replicate - it should match the actual data. In the example below, it is the sum of 'Clone count' in 'Q1_0_F2_.txt.gz'.--Nclones NNNN: total number of clones in union of two replicates - it should match the actual data. In the example below, it is the number of clones present in both replicates : 'Q1_0_F1_.txt.gz' and 'Q1_0_F2_.txt.gz'.4/ Output file
--output 'SYNTHETICDATA': name of the output file where you can find the synthetic data set.
At the command prompt, type
$ noiset-nullgenerator --NB --nullpara 'data_examples/nullpara1.txt' --NreadsI 829578 --NreadsII 954389 --Nclones 776247 --output 'test'
Running this line, you create a 'synthetic_test.csv' file with four columns : 'Clone_count_1', 'Clone_count_2', 'Clone_fraction_1', 'Clone_fraction_2', resctively synthetic read counts and frequencies that you would have found in an experimental sample of same learned parameters 'nullpara1.txt', 'NreadsI', 'NreadsII' and 'Nclones'.
3/ Detect responding clones:
Detects responding clones to a stimulus: NoisET second function (2)
To detect responding clones from two RepSeq data at time_1 and time_2, use the command noiset-detection
$ noiset-detection --(noisemodel) --nullpara1 'FILEFORPARAS1' --nullpara2 'FILEFORPARAS1' --path 'REPO/' --f1 'FILENAME_TIME_1' --f2 'FILENAME_TIME_2' --pval float --smedthresh float --output 'DETECTIONDATA'
Several options are needed to learn noise model from two replicate samples associated to one individual at a specific time point:
1/ Choice of noise model:
--NBPoisson: Negative Binomial + Poisson Noise Model - 5 parameters--NB: Negative Binomial - 4 parameters--Poisson: Poisson - 2 parameters2/ Specify learned parameters for both time points:
(they can be the same for both time points if replicates are not available but to use carefully as mentioned in [ARTICLE])
--nullpara1 'PATH/FOLDER/NULLPARAS1.txt': parameters learned thanks to NoisET function (1) for time 1--nullpara2 'PATH/FOLDER/NULLPARAS2.txt': parameters learned thanks to NoisET function (1) for time 2
!!! Make sure to match correctly the noise model and the null parameters file content : 5 parameters for--NBP, 4 parameters for--NBand 2 parameters for--Poisson.
3/ Data information:
--path 'PATHTODATA': set path to data file--f1 'FILENAME1_X_time1': filename for individual X time 1--f2 'FILENAME2_X_time2': filename for individual X time 2 If your TCR CDR3 clonal populations features (ie clonal fractions, clonal counts, clonal nucleotides CDR3 sequences and clonal amino acids sequences) have different column names than: ('Clone fraction', 'Clone count', 'N. Seq. CDR3', 'AA. Seq. CDR3), you can specify the name by using:--specify--freq 'frequency': Column label associated to clonal fraction--counts 'counts': Column label associated to clonal count--ntCDR3 'ntCDR3': Column label associated to clonal CDR3 nucleotides sequence--AACDR3 'AACDR3': Column label associated to clonal CDR3 amino acid sequence4/ Detection thresholds: (More details in Methods section).
--pval XXX: p-value threshold for the expansion/contraction - use 0.05 as a default value.--smedthresh XXX: log fold change median threshold for the expansion/contraction - use 0 as a default value.5/ Output file
--output 'DETECTIONDATA': name of the output file (.csv) where you can find a list of the putative responding clones with statistics features. (More details in Methods section).
At the command prompt, type
$ noiset-detection --NB --nullpara1 'data_examples/nullpara1.txt' --nullpara2 'data_examples/nullpara1.txt' --path 'data_examples/' --f1 'Q1_0_F1_.txt.gz' --f2 'Q1_15_F1_.txt.gz' --pval 0.05 --smedthresh 0 --output 'detection'
Ouput: table containing all putative detected clones with statistics features about logfold-change variable s : more theoretical description Puelma Touzel et al, 2020.
Python package
1""" 2# NoisET<sup>*</sup> NOIse sampling learning & Expansion detection of T-cell receptors using Bayesian inference. 3High-throughput sequencing of T- and B-cell receptors makes it possible to track immune 4repertoires across time, in different tissues, in acute and chronic diseases or in healthy individuals. However 5quantitative comparison between repertoires is confounded by variability in the read count of each receptor 6clonotype due to sampling, library preparation, and expression noise. We present an easy-to-use python 7package NoisET that implements and generalizes a previously developed Bayesian method in [Puelma Touzel et al, 2020](<https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007873&rev=2>). It can be used 8to learn experimental noise models for repertoire sequencing from replicates, and to detect responding 9clones following a stimulus. The package was tested on different repertoire sequencing technologies and 10datasets. NoisET package is desribed [here](<https://arxiv.org/abs/2102.03568>). 11<sup>* NoisET should be pronounced as "noisettes" (ie hazelnuts in French).</sup> 12Functions library for NoisET - construction of noisettes package 13Copyright (C) 2021 Meriem Bensouda Koraichi. 14 This program is free software: you can redistribute it and/or modify 15 it under the terms of the GNU General Public License as published by 16 the Free Software Foundation, either version 3 of the License, or 17 (at your option) any later version. 18 This program is distributed in the hope that it will be useful, 19 but WITHOUT ANY WARRANTY; without even the implied warranty of 20 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 21 GNU General Public License for more details. 22 You should have received a copy of the GNU General Public License 23 along with this program. If not, see <https://www.gnu.org/licenses/>. 24# Installation 25Python 3 26NoisET is a python /3.6 software. It is available on PyPI and can be downloaded and installed through pip: 27```console 28$ pip install noisets 29``` 30Watch out, Data pre-processing, diversity estimates and generation of neutral TCR clonal dynamics is not possible yet with installation with pip. Use only the sudo command below. 31To install NoisET and try the tutorial dusplayed in this github: gitclone the file in your working environment. 32Using the terminal, go to NoisET directory and write the following command : 33```console 34$ sudo python setup.py install 35``` 36If you do not have the following python libraries (that are useful to use NoisET) : numpy, pandas, matplotlib, seaborn, scipy, scikit-learn, please do the following commands, to try first to install the dependencies separately: : 37 ``` 38python -m pip install -U pip 39python -m pip install -U matplotlib 40pip install numpy 41pip install pandas 42pip install matplotlib 43pip install seaborn 44pip install -U scikit-learn 45 ``` 46# Documentation 47## Command lines with terminal 48A tutorial is available at https://github.com/mbensouda/NoisET_tutorial . 49Three commands are available to use : 50- `noiset-noise` To infer Null noise model: NoisET first function (1) 51- `noiset-nullgenerator` To qualitatively check consistency of NoisET first function 52- `noiset-detection` To detect responding clones to a stimulus: NoisET second function (2) 53All options are described typing one of the previous commands + `--help`or `-h`. Options are also described in the following READme. 54## 1/ Infer noise model 55To infer null noise model: NoisET first function (1), use the command `noiset-noise` 56At the command prompt, type: 57```console 58$ noiset-noise --path 'DATA_REPO/' --f1 'FILENAME1_X_REP1' --f2 'FILENAME2_X_REP2' --(noisemodel) 59``` 60Several options are needed to learn noise model from two replicate samples associated to one individual at a specific time point: 61#### 1/ Data information: 62- `--path 'PATHTODATA'`: set path to data file 63- `--f1 'FILENAME1_X_REP1'`: filename for individual X replicate 1 64- `--f2 'FILENAME2_X_REP2'`: filename for individual X replicate 2 65If your TCR CDR3 clonal populations features (ie clonal fractions, clonal counts, clonal nucleotide CDR3 sequences and clonal amino acid sequences) have different column names than: ('Clone fraction', 'Clone count', 'N. Seq. CDR3', 'AA. Seq. CDR3), you can specify the name directly by using: 66- `--specify` 67- `--freq 'frequency'` : Column label associated to clonal fraction 68- `--counts 'counts'`: Column label associated to clonal count 69- `--ntCDR3 'ntCDR3'`: Column label associated to clonal CDR3 nucleotides sequence 70- `--AACDR3 'AACDR3'`: Column label associated to clonal CDR3 amino acid sequence 71#### 2/ Choice of noise model: (parameters meaning described in Methods section) 72- `--NBPoisson`: Negative Binomial + Poisson Noise Model - 5 parameters 73- `--NB`: Negative Binomial - 4 parameters 74- `--Poisson`: Poisson - 2 parameters 75#### 3/ Example: 76At the command prompt, type: 77```console 78$ noiset-noise --path 'data_examples/' --f1 'Q1_0_F1_.txt.gz' --f2 'Q1_0_F2_.txt.gz' --NB 79``` 80This command line will learn four parameters associated to negative binomial null noise Model `--NB` for individual Q1 at day 0. 81A '.txt' file is created in the working directory: it is a 5/4/2 parameters data-set regarding on NBP/NB/Poisson noise model. In this example, it is a four parameters table (already created in data_examples repository). 82You can run previous examples using data (Q1 day 0/ day15) provided in the data_examples folder - data from [Precise tracking of vaccine-responding T cell clones reveals convergent and personalized response in identical twins, Pogorelyy et al, PNAS](https://www.pnas.org/content/115/50/12704) 83#### 4/ Example with `--specify`: 84At the command prompt, type: 85```console 86$ noiset-noise --path 'data_examples/' --f1 'replicate_1_1.tsv.gz' --f2 'replicate_1_2.tsv.gz' --specify --freq 'frequencyCount' --counts 'count' --ntCDR3 'nucleotide' --AACDR3 'aminoAcid' --NB 87``` 88As previously this command enables us to learn four parameters associated to negative binomial null noise model `--NB` for one individual in cohort produced in [Model to improve specificity for identification of clinically-relevant expanded T cells in peripheral blood, Rytlewski et al, PLOS ONE](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0213684). 89## 2/ Generate synthetic data from null model learning: 90To qualitatively check consistency of NoisET first function (1) with experiments or for other reasons, it can be useful to generates synthetic replicates from the null model (described in Methods section). 91One can also generalte healthy RepSeq samples dynamics using the noise model which has been learned in a first step anf giving the time-scale dynamics of turnover of the repertoire as defined in https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. Check [here](<https://github.com/statbiophys/NoisET/blob/master/NoisET%20example%20-%20Null%20model%20learning%20.ipynb>). 92To generate synthetic TCR RepSeq data replicates having chosen sampling noise characteristics, use the command `noiset-nullgenerator` 93 ```console 94 $ noiset-nullgenerator --(noise-model) --nullpara 'NULLPARAS' --NreadsI float --NreadsII float --Nclones float --output 'SYNTHETICDATA' 95 ``` 96#### 1/ Choice of noise model: 97The user must chose one of the three possible models for the probability that a TCR has <strong> an empirical count n </strong> knowing that its <strong> true frequency is f </strong>, P(n|f): a Poisson distribution `--Poisson`, a negative binomial distribution `--NB`, or a two-step model combining Negative-Binomial and a Poisson distribution `--NBP`. n is the empirical clone size and depends on the experimental protocol. 98For each P(n|f), a set of parameters is learned. 99- `--NBPoisson`: Negative Binomial + Poisson Noise Model - 5 parameters 5 parameters described in [Puelma Touzel et al, 2020](<https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007873&rev=2>): power-law exponent of clonotypes frequencies distributions `'alph_rho'`, minimum of clonotype frequencies distribution `'fmin'`, `'beta'` and `'alpha'`, parameters of negative binomial distribution constraining mean and variance of P(m|f) distribution (m being the number of cells associated to a clonotype in the experiemental sample), and `'m_total'` the total number of cells in the sample of interest.. 100- `--NB`: Negative Binomial - 4 parameters: power-law of the clonotypes frequencies distributions (same ansatz than in [Puelma Touzel et al, 2020](<https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007873&rev=2>) `'alph_rho'`, minimum of clonotype frequencies distribution `'fmin'`, `'beta'` and `'alpha'`, parameters of negative binomial distribution constraining mean and variance of P(n|f) distribution. <em> NB(fNreads, fNreads + betafNreads<sup>alpha</sup>) </em>. (Nreads is the total number of reads in the sample of interest.) 101- `--Poisson`: Poisson - 2 parameters power-law of the clonotypes frequencies distributions (same ansatz than in [Puelma Touzel et al, 2020](<https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007873&rev=2>)`'alph_rho'` and minimum of clonotype frequencies distribution `'fmin'`. P(n|f) is a Poisson distribution of parameter <em> fNreads </em>. (Nreads is the total number of reads in the sample of interest.) 102#### 2/ Specify learned noise parameters: 103- `--nullpara 'PATHTOFOLDER/NULLPARAS.txt'`: parameters learned thanks to NoisET function (1) \ 104!!! Make sure to match correctly the noise model and the null parameter file content : 5 parameters for `--NBP`, 4 parameters for `--NB`and 2 parameters 105for `--Poisson`. 106#### 3/ Sequencing properties of data: 107- `--NreadsI NNNN`: total number of reads in first replicate - it should match the actual data. In the example below, it is the sum of 'Clone count' in 'Q1_0_F1_.txt.gz'. 108- `--Nreads2 NNNN`: total number of reads in second replicate - it should match the actual data. In the example below, it is the sum of 'Clone count' in 'Q1_0_F2_.txt.gz'. 109- `--Nclones NNNN`: total number of clones in union of two replicates - it should match the actual data. In the example below, it is the number of clones present in both replicates : 'Q1_0_F1_.txt.gz' and 'Q1_0_F2_.txt.gz'. 110#### 4/ Output file 111`--output 'SYNTHETICDATA'`: name of the output file where you can find the synthetic data set. 112At the command prompt, type 113 ```console 114 $ noiset-nullgenerator --NB --nullpara 'data_examples/nullpara1.txt' --NreadsI 829578 --NreadsII 954389 --Nclones 776247 --output 'test' 115 ``` 116 Running this line, you create a 'synthetic_test.csv' file with four columns : 'Clone_count_1', 'Clone_count_2', 'Clone_fraction_1', 'Clone_fraction_2', resctively synthetic read counts and frequencies that you would have found in an experimental sample of same learned parameters 'nullpara1.txt', 'NreadsI', 'NreadsII' and 'Nclones'. 117## 3/ Detect responding clones: 118 119Detects responding clones to a stimulus: NoisET second function (2) 120To detect responding clones from two RepSeq data at time_1 and time_2, use the command `noiset-detection` 121```console 122$ noiset-detection --(noisemodel) --nullpara1 'FILEFORPARAS1' --nullpara2 'FILEFORPARAS1' --path 'REPO/' --f1 'FILENAME_TIME_1' --f2 'FILENAME_TIME_2' --pval float --smedthresh float --output 'DETECTIONDATA' 123``` 124Several options are needed to learn noise model from two replicate samples associated to one individual at a specific time point: 125#### 1/ Choice of noise model: 126- `--NBPoisson`: Negative Binomial + Poisson Noise Model - 5 parameters 127- `--NB`: Negative Binomial - 4 parameters 128- `--Poisson`: Poisson - 2 parameters 129#### 2/ Specify learned parameters for both time points: 130(they can be the same for both time points if replicates are not available but to use carefully as mentioned in [ARTICLE]) 131- `--nullpara1 'PATH/FOLDER/NULLPARAS1.txt'`: parameters learned thanks to NoisET function (1) for time 1 132- `--nullpara2 'PATH/FOLDER/NULLPARAS2.txt'`: parameters learned thanks to NoisET function (1) for time 2 133!!! Make sure to match correctly the noise model and the null parameters file content : 5 parameters for `--NBP`, 4 parameters for `--NB`and 2 parameters 134for `--Poisson`. 135#### 3/ Data information: 136- `--path 'PATHTODATA'`: set path to data file 137- `--f1 'FILENAME1_X_time1'`: filename for individual X time 1 138- `--f2 'FILENAME2_X_time2'`: filename for individual X time 2 139If your TCR CDR3 clonal populations features (ie clonal fractions, clonal counts, clonal nucleotides CDR3 sequences and clonal amino acids sequences) have different column names than: ('Clone fraction', 'Clone count', 'N. Seq. CDR3', 'AA. Seq. CDR3), you can specify the name by using: 140- `--specify` 141- `--freq 'frequency'` : Column label associated to clonal fraction 142- `--counts 'counts'`: Column label associated to clonal count 143- `--ntCDR3 'ntCDR3'`: Column label associated to clonal CDR3 nucleotides sequence 144- `--AACDR3 'AACDR3'`: Column label associated to clonal CDR3 amino acid sequence 145#### 4/ Detection thresholds: (More details in Methods section). 146- `--pval XXX` : p-value threshold for the expansion/contraction - use 0.05 as a default value. 147- `--smedthresh XXX` : log fold change median threshold for the expansion/contraction - use 0 as a default value. 148#### 5/ Output file 149`--output 'DETECTIONDATA'`: name of the output file (.csv) where you can find a list of the putative responding clones with statistics features. (More details in Methods section). 150At the command prompt, type 151```console 152$ noiset-detection --NB --nullpara1 'data_examples/nullpara1.txt' --nullpara2 'data_examples/nullpara1.txt' --path 'data_examples/' --f1 'Q1_0_F1_.txt.gz' --f2 'Q1_15_F1_.txt.gz' --pval 0.05 --smedthresh 0 --output 'detection' 153``` 154Ouput: table containing all putative detected clones with statistics features about logfold-change variable <em> s </em>: more theoretical description [Puelma Touzel et al, 2020](<https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007873&rev=2>). 155## Python package 156""" 157 158 159# Import python libraries 160import os 161import time 162import math 163from copy import deepcopy 164from decimal import Decimal 165from functools import partial 166 167import matplotlib.pyplot as plt 168from matplotlib import cm, colors, colorbar 169import numpy as np 170import pandas as pd 171import seaborn as sns 172from scipy import stats 173from scipy.stats import nbinom 174from scipy.stats import poisson 175from scipy.stats import rv_discrete 176from datetime import datetime, date 177from scipy.optimize import minimize 178 179#tools for PCA 180from sklearn.decomposition import PCA 181from sklearn.cluster import AgglomerativeClustering 182 183#tools to generate RepSeq traj 184import shutil 185from multiprocessing import Pool, cpu_count 186from functools import partial 187 188###===================================TOOLS-TO-GENERATE-NEUTRAL-TCR-REP-SEQ-TRAJECTORIES===================================================== 189# Library functions to generate TCR repertoires 190##------------------------Initial-Distributions------------------------ 191def _rho_counts_theo_minus_x(A, B, N_0): 192 193 # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 194 # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 195 # this function is not made for a NoisET user 196 197 # I am disretizing the logspace with nfbins = 100000 198 199 200 Cmin = 1 201 freq_dtype = 'float32' 202 203 N_cells = int(1e10) 204 S_c = -(A+B/2)*N_cells/(N_0-1) 205 206 alpha = -2*A/B 207 208 nbins_1 = 100000 209 210 logcountvec = np.linspace(np.log10(Cmin),np.log10(N_0), nbins_1) 211 log_countvec_minus = np.array(np.log(np.power(10,logcountvec)) ,dtype=freq_dtype).flatten() 212 log_rho_minus = np.log(-(S_c/A))+ np.log(1-np.exp(-alpha*log_countvec_minus)) 213 214 N_clones_1 = -(S_c/A)*(np.log(N_0) - (1/alpha)*(1 - N_0**(-alpha))) 215 216 217 return log_rho_minus, log_countvec_minus, N_clones_1 218 219def _rho_counts_theo_plus_x(A, B, N_0): 220 # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 221 # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 222 # this function is not made for a NoisET user 223 # I am disretizing the logspace with nfbins = 100000, I can put a better discretization than for the minus 224 # distribution 225 226 Cmax = int(1e10) 227 #Cmax = np.inf 228 freq_dtype = 'float32' 229 230 N_cells = int(1e10) 231 S_c = -(A+B/2)*N_cells/(N_0 -1) 232 233 alpha = -2*A/B 234 235 nbins_2 = 100000 236 237 logcountvec = np.linspace(np.log10(N_0),np.log10(Cmax), nbins_2 ) 238 log_countvec_plus = np.array(np.log(np.power(10,logcountvec)) ,dtype=freq_dtype).flatten() 239 log_rho_plus = np.log(N_0**alpha-1) + np.log(-(S_c/A)) -(alpha)*log_countvec_plus 240 241 N_clones_2 = -(S_c/(A*alpha))*(1 - N_0**(-alpha)) 242 243 244 return log_rho_plus, log_countvec_plus, N_clones_2 245 246 247def _get_distsample(pmf,Nsamp, dtype='uint32'): 248 249 # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 250 # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 251 # this function is not made for a NoisET user 252 ''' 253 generates Nsamp index samples of dtype (e.g. uint16 handles up to 65535 indices) from discrete probability mass function pmf. 254 Handles multi-dimensional domain. N.B. Output is sorted. 255 ''' 256 #assert np.sum(pmf)==1, "cmf not normalized!" 257 258 shape = np.shape(pmf) 259 sortindex = np.argsort(pmf, axis=None)#uses flattened array 260 pmf = pmf.flatten() 261 pmf = pmf[sortindex] 262 cmf = np.cumsum(pmf) 263 #print('cumulative distribution is equal to: ' + str(cmf[-1])) 264 choice = np.random.uniform(high = cmf[-1], size = int(float(Nsamp))) 265 index = np.searchsorted(cmf, choice) 266 index = sortindex[index] 267 index = np.unravel_index(index, shape) 268 index = np.transpose(np.vstack(index)) 269 sampled_inds = np.array(index[np.argsort(index[:,0])], dtype=dtype) 270 return sampled_inds 271 272##------------------------Propagator------------------------ 273def _gaussian_matrix(x_vec, x_i_vec_unique, A, B, t): 274 275 # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 276 # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 277 # this function is not made for a NoisET user 278 x_vec_reshaped = np.reshape(x_vec, (len(x_vec), 1)) 279 ones_vec = np.ones((len(x_i_vec_unique), 1)) 280 M = np.multiply(ones_vec, x_vec_reshaped.T) 281 x_i_unique_reshaped = np.reshape(x_i_vec_unique, (len(x_i_vec_unique), 1)) 282 283 return (1/np.sqrt(2*np.pi*B*t))*np.exp((-1/(2*B*t))*(M - x_i_unique_reshaped - A*t)**2) 284 285def _gaussian_adsorption_matrix(x_vec, x_i_vec_unique, A, B, t): 286 287 # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 288 # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 289 # this function is not made for a NoisET user 290 a = 0 291 gauss = _gaussian_matrix(x_vec, x_i_vec_unique, A, B, t) 292 gauss_a = _gaussian_matrix(x_vec, 2*a-x_i_vec_unique, A, B, t) 293 x_i_unique_reshaped = np.reshape(x_i_vec_unique, (len(x_i_vec_unique), 1)) 294 return gauss - np.exp((A*(a-x_i_unique_reshaped))/(B/2)) * gauss_a 295 296def _extinction_vector(x_i, A, B, t): 297 298 # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 299 # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 300 # this function is not made for a NoisET user 301 nbins = 2000 302 eps = 1e-20 303 #eps = 0 304 x_vec = np.linspace(eps, np.max(x_i) - A*t + 3*np.sqrt(B*t), nbins) 305 306 x_i_sorted = np.sort(x_i) 307 308 xiind_vals, xi_start_ind, xi_counts=np.unique(x_i_sorted, return_counts=True,return_index=True) 309 Prop_Matrix = _gaussian_adsorption_matrix(x_vec, xiind_vals, A, B, t) 310 311 dx =np.asarray(np.diff(x_vec)/2., dtype='float32') 312 integ = np.sum(dx*(Prop_Matrix[:, 1:]+Prop_Matrix[:, :-1]), axis = 1) 313 p_ext = 1 - integ 314 315 p_ext_new = np.zeros((len(x_i))) 316 for it,xiind in enumerate(xiind_vals): 317 p_ext_new[xi_start_ind[it]:xi_start_ind[it]+xi_counts[it]] = p_ext[it] 318 319 test = np.random.uniform(0,1, size = (len(p_ext_new))) > p_ext_new 320 results_extinction = test.astype(int) 321 322 return results_extinction, Prop_Matrix, x_vec, xiind_vals, xi_start_ind, xi_counts, p_ext 323 324#------------------------Source-term-no-frequency-dependency------------------------ 325 326def _gaussian_matrix_time(x_vec, x_i_scal, A, B, tvec_unique): 327 328 # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 329 # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 330 # this function is not made for a NoisET user 331 332 x_vec_reshaped = np.reshape(x_vec, (len(x_vec), 1)) 333 ones_vec = np.ones((len(tvec_unique), 1)) 334 M = np.multiply(ones_vec, x_vec_reshaped.T) 335 tvec_unique_reshaped = np.reshape(tvec_unique, (len(tvec_unique), 1)) 336 #x_i_unique_reshaped = np.reshape(x_i_vec_unique, (len(x_i_vec_unique), 1)) 337 338 return (1/np.sqrt(2*np.pi*B*tvec_unique_reshaped))*np.exp((-1/(2*B*tvec_unique_reshaped))*(M - x_i_scal - A*tvec_unique_reshaped)**2) 339 340def _gaussian_adsorption_matrix_time(x_vec, x_i_scal, A, B, tvec_unique): 341 342 # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 343 # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 344 # this function is not made for a NoisET user 345 346 a = 0 347 gauss = _gaussian_matrix_time(x_vec, x_i_scal, A, B, tvec_unique) 348 gauss_a = _gaussian_matrix_time(x_vec, 2*a-x_i_scal, A, B, tvec_unique) 349 350 return gauss - np.exp((A*(a-x_i_scal))/(B/2)) * gauss_a 351 352def _Prop_Matrix_source( A, B, tvec): 353 354 # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 355 # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 356 # this function is not made for a NoisET user 357 358 nbins = 2000 359 N_0 = 40 360 x_i_scal = np.log(N_0) 361 t = np.max(tvec) 362 x_vec = np.linspace(0, x_i_scal - A*t + 2*np.sqrt(B*t), nbins) 363 364 tvec_sorted = np.sort(tvec) 365 366 tiind_vals, ti_start_ind, ti_counts=np.unique(tvec_sorted, return_counts=True,return_index=True) 367 Prop_Matrix = _gaussian_adsorption_matrix_time(x_vec, x_i_scal, A, B, tiind_vals) 368 369 dx =np.asarray(np.diff(x_vec)/2., dtype='float32') 370 integ = np.sum(dx*(Prop_Matrix[:, 1:]+Prop_Matrix[:, :-1]), axis = 1) 371 372 return Prop_Matrix, x_vec, tiind_vals, ti_start_ind, ti_counts, integ 373 374def _extinction_vector_source(A, B, tvec): 375 376 # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 377 # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 378 # this function is not made for a NoisET user 379 380 nbins = 2000 381 N_0 = 40 382 x_i_scal = np.log(N_0) 383 t = np.max(tvec) 384 x_vec = np.linspace(0, x_i_scal - A*t + 2*np.sqrt(B*t), nbins) 385 386 tvec_sorted = np.sort(tvec) 387 388 tiind_vals, ti_start_ind, ti_counts=np.unique(tvec_sorted, return_counts=True,return_index=True) 389 Prop_Matrix = _gaussian_adsorption_matrix_time(x_vec, x_i_scal, A, B, tiind_vals) 390 391 dx =np.asarray(np.diff(x_vec)/2., dtype='float32') 392 integ = np.sum(dx*(Prop_Matrix[:, 1:]+Prop_Matrix[:, :-1]), axis = 1) 393 p_ext = 1 - integ 394 395 p_ext_new = np.zeros((len(tvec))) 396 for it,tiind in enumerate(tiind_vals): 397 p_ext_new[ti_start_ind[it]:ti_start_ind[it]+ti_counts[it]] = p_ext[it] 398 399 test = np.random.uniform(0,1, size = (len(p_ext_new))) > p_ext_new 400 results_extinction = test.astype(int) 401 402 403 return results_extinction, Prop_Matrix, x_vec, tiind_vals, ti_start_ind, ti_counts, p_ext 404 405##------------------------Function-to-generate-in-silico-Rep-Seq-samples------------------------ 406 407def _generator_diffusion_LB(A, B, N_0, t): 408 409 # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 410 # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 411 # this function is not made for a NoisET user 412 413 eps = 1e-20 414 415 ## Choose initial size of the immune system to be 1e10 (for a mouse) 416 N_cells = int(1e10) 417 418 #Parameters for the repertoire generation 419 alpha_rho = -1 + 2*A/B 420 N_ext = 1 421 freq_dtype = 'float32' 422 423 #==========================generate the steady state distribution=============================== 424 425 #for counts < N0: 426 logrhofvec,logfvec, N_clones_1 = _rho_counts_theo_minus_x(A, B, N_0) 427 dlogfby2=np.asarray(np.diff(logfvec)/2., dtype='float32') 428 integ=np.exp(logrhofvec[np.newaxis,:]) 429 f_samples_inds=_get_distsample(np.asarray((dlogfby2[np.newaxis,:]*(integ[:,1:]+integ[:,:-1])).flatten(),dtype='float32'), N_clones_1,dtype='uint32').flatten() 430 #print("generation population smaller than N_0: check") 431 432 logcvec_generated = logfvec[f_samples_inds] 433 counts_generated = np.exp(logcvec_generated) 434 C_f_minus = np.sum(counts_generated) 435 print(str(C_f_minus) + ' cells smaller than N_0') 436 log_cminus_generated = logcvec_generated 437 logrhofvec_1,logfvec_1 = logrhofvec,logfvec 438 print(str(N_clones_1) + ' N_clones_1') 439 440 #for counts > N0: 441 442 logrhofvec,logfvec, N_clones_2 = _rho_counts_theo_plus_x(A, B, N_0) 443 dlogfby2=np.asarray(np.diff(logfvec)/2., dtype='float32') 444 integ=np.exp(logrhofvec[np.newaxis,:]) 445 f_samples_inds=_get_distsample(np.asarray((dlogfby2[np.newaxis,:]*(integ[:,1:]+integ[:,:-1])).flatten(),dtype='float32'),N_clones_2,dtype='uint32').flatten() 446 #print("generation population larger than N_0: check") 447 logcvec_generated = logfvec[f_samples_inds] 448 counts_generated = np.exp(logcvec_generated) 449 C_f_plus = np.sum(counts_generated) 450 print(str(C_f_plus) + ' N_cells larger than N_0') 451 log_cplus_generated = logcvec_generated 452 logrhofvec_2,logfvec_2 = logrhofvec,logfvec 453 print(str(N_clones_2) + ' N_clones_2') 454 455 #=================================================== 456 457 N_clones = int(N_clones_1 + N_clones_2) 458 print('N_clones= ' + str(N_clones)) 459 460 S_c = - (A + B/2)*(N_cells/(N_0-1)) 461 print('N_clones_theory= ' + str(-(S_c/A)*np.log(N_0))) 462 463 464 x_i = np.concatenate((log_cminus_generated, log_cplus_generated), axis = None) 465 466 N_total_cells_generated = np.sum(np.exp(x_i)) 467 print("N_total_cells_generated/N_total_cells:" + str(N_total_cells_generated/N_cells)) 468 469 470 471 results_extinction, Prop_Matrix, x_vec, xiind_vals, xi_start_ind, xi_counts, p_ext = _extinction_vector(x_i, A, B, t) 472 #x_vec = np.linspace(0, 30*B*t, 2000) 473 dx=np.asarray(np.diff(x_vec)/2., dtype='float32') 474 475 x_i_noext= x_i[np.where(results_extinction ==1)] 476 x_f = np.zeros((len(x_i))) 477 478 for i in range(len(xiind_vals)): 479 480 481 if (np.dot(dx, Prop_Matrix[i,1:] + Prop_Matrix[i, :-1])) < 1e-7: 482 pass 483 484 else: 485 486 Prop_adsorp = Prop_Matrix[i,:] / (np.dot(dx, Prop_Matrix[i,1:] + Prop_Matrix[i, :-1])) 487 488 integ = Prop_adsorp[np.newaxis,:] 489 f_samples_inds = _get_distsample(np.asarray((dx[np.newaxis,:]*(integ[:,1:]+integ[:,:-1])).flatten(),dtype='float32'), xi_counts[i],dtype='uint32').flatten() 490 491 x_f[xi_start_ind[i]:xi_start_ind[i]+xi_counts[i]] = x_vec[f_samples_inds] 492 493 x_f = np.multiply(x_f,results_extinction) 494 495 496 x_f[x_f == 0] = -np.inf 497 498 N_extinction = np.sum(1- results_extinction) 499 N_extinction = len(x_f[x_f == -np.inf]) 500 501 print('Number of extinction= ' + str(N_extinction)) 502 sim_ext = (N_extinction/len(results_extinction))*100 503 theo_ext = (-A/np.log(N_0))*100 504 print('simulations % of extinction= ' + str((N_extinction/len(results_extinction))*100/t) + '%') 505 print('theoretical % of extinction= ' + str((-A/np.log(N_0))*100) + '%') 506 507 508 #Source term 509 510 N_source = S_c*t 511 512 print('Number of insertions= ' +str(N_source)) 513 514 N_source = int(N_source) 515 516 eps = 1e-8 517 time_vec_span = np.linspace(eps, t, 5000) 518 time_vec = np.random.choice(time_vec_span, N_source) 519 time_vec = np.sort(time_vec) 520 521 results_extinction_source, Prop_Matrix_source, x_vec_source, tiind_vals, ti_start_ind, ti_counts, p_ext_source = _extinction_vector_source(A, B, time_vec) 522 523 dx_source=np.asarray(np.diff(x_vec_source)/2., dtype='float32') 524 525 x_source_LB = np.zeros((N_source)) 526 for i in range(len(tiind_vals)): 527 528 if (np.dot(dx_source, Prop_Matrix_source[i,1:] + Prop_Matrix_source[i, :-1])) < 1e-7: 529 pass 530 531 else: 532 Prop_adsorp_s = Prop_Matrix_source[i,:] 533 Prop_adsorp_s = Prop_Matrix_source[i,:] / (np.dot(dx_source, Prop_Matrix_source[i,1:] + Prop_Matrix_source[i, :-1])) 534 535 536 integ = Prop_adsorp_s[np.newaxis,:] 537 f_samples_inds_s = _get_distsample(np.asarray((dx_source[np.newaxis,:]*(integ[:,1:]+integ[:,:-1])).flatten(),dtype='float32'), ti_counts[i],dtype='uint32').flatten() 538 539 x_source_LB[ti_start_ind[i]:ti_start_ind[i]+ti_counts[i]] = x_vec_source[f_samples_inds_s] 540 541 542 x_source_LB = np.multiply(x_source_LB, results_extinction_source) 543 544 x_source_LB[x_source_LB == 0] = -np.inf 545 546 547 548 return x_i, x_f, Prop_Matrix, p_ext, results_extinction, time_vec, results_extinction_source, x_source_LB 549 550def _experimental_sampling_diffusion_Poisson(NreadsI, NreadsII, x_0, x_2, t, N_cell_0, N_cell_2): 551 552 553 # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 554 # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 555 # this function is not made for a NoisET user 556 557 558 #----------------------------Counts generation -------------------------------------------- 559 560 ##Initial condition 561 N_total_0 = len(x_0[x_0 != -np.inf]) 562 x_0_bis = x_0[x_0 != -np.inf] 563 564 print('Number of clones at initial time ' + str(N_total_0)) 565 566 N_total_2 = len(x_2[x_2 != -np.inf]) 567 x_2_bis = x_2[x_2 != -np.inf] 568 569 570 print('Number of clones after ' + str(t) + ' year(s) ' + str(N_total_2)) 571 572 #N_total = min(N_total_0, N_total_2) 573 assert len(x_0) == len(x_2) 574 N_total = len(x_0) 575 576 x_2_final = x_2[:N_total] 577 578 579 f_vec_initial = np.exp(x_0)/N_cell_0 580 m=float(NreadsI)*f_vec_initial 581 n_counts_day_0 = np.random.poisson(m, size =(1, int(N_total))) 582 n_counts_day_0 = n_counts_day_0[0,:] 583 584 #print('done') 585 586 #Final condition 587 f_vec_end = np.exp(x_2_final)/N_cell_2 588 m=float(NreadsII)*f_vec_end 589 #print(m) 590 print('MEAN N : ' + str(np.mean(m))) 591 n_counts_day_1 = np.random.poisson(m, size =(1, int(N_total))) 592 print(n_counts_day_1) 593 n_counts_day_1 = n_counts_day_1[0,:] 594 595 596 #-------------------------------Creation of the data set------------------------------------- 597 598 obs=np.logical_or(n_counts_day_0>0, n_counts_day_1>0) 599 n1_samples=n_counts_day_0[obs] 600 n2_samples=n_counts_day_1[obs] 601 pair_samples_df= pd.DataFrame({'Clone_count_1':n1_samples,'Clone_count_2':n2_samples}) 602 603 pair_samples_df['Clone_frequency_1'] = pair_samples_df['Clone_count_1'] / np.sum(pair_samples_df['Clone_count_1']) 604 pair_samples_df['Clone_frequency_2'] = pair_samples_df['Clone_count_2'] / np.sum(pair_samples_df['Clone_count_2']) 605 606 607 return pair_samples_df 608 609def _experimental_sampling_diffusion_NegBin(NreadsI, NreadsII, paras, x_0, x_2, N_cell_0, N_cell_2): 610 611 612 #----------------------------Counts generation -------------------------------------------- 613 614 ##Initial condition 615 N_total_0 = len(x_0[x_0 != -np.inf]) 616 x_0_bis = x_0[x_0 != -np.inf] 617 618 print('Number of clones at initial time ' + str(N_total_0)) 619 620 N_total_2 = len(x_2[x_2 != -np.inf]) 621 x_2_bis = x_2[x_2 != -np.inf] 622 623 print('Number of clones after 2 years ' + str(N_total_2)) 624 625 #N_total = min(N_total_0, N_total_2) 626 assert len(x_0) == len(x_2) 627 N_total = len(x_0) 628 629 630 f_vec_initial = np.exp(x_0)/N_cell_0 631 m=float(NreadsI)*f_vec_initial 632 print(m) 633 634 beta_mv=paras[1] 635 alpha_mv=paras[2] 636 637 v=m+beta_mv*np.power(m,alpha_mv) 638 639 pvec=1-m/v 640 nvec=m*m/v/pvec 641 642 pvec = np.nan_to_num(pvec, nan=0.0) 643 nvec = np.nan_to_num(nvec, nan=1e-30) 644 645 print(pvec) 646 print(1-pvec) 647 print(np.sum(pvec>=1)) 648 print(nvec) 649 650 n_counts_day_0 = np.random.negative_binomial(nvec, 1-pvec, size =(1, int(N_total))) 651 n_counts_day_0 = n_counts_day_0[0,:] 652 print(n_counts_day_0) 653 654 655 #Final condition 656 f_vec_end = np.exp(x_2)/N_cell_2 657 m_end=float(NreadsII)*f_vec_end 658 print(m_end) 659 660 v_end=m_end+beta_mv*np.power(m_end,alpha_mv) 661 pvec_end=1-m_end/v_end 662 nvec_end=m_end*m_end/v_end/pvec_end 663 664 pvec_end = np.nan_to_num(pvec_end, nan=0.0) 665 nvec_end = np.nan_to_num(nvec_end, nan=1e-30) 666 667 668 n_counts_day_1 = np.random.negative_binomial(nvec_end, 1-pvec_end, size =(1, int(N_total))) 669 n_counts_day_1 = n_counts_day_1[0,:] 670 print(n_counts_day_1) 671 672 673 #-------------------------------Creation of the data set------------------------------------- 674 675 obs=np.logical_or(n_counts_day_0>0, n_counts_day_1>0) 676 n1_samples=n_counts_day_0[obs] 677 n2_samples=n_counts_day_1[obs] 678 pair_samples_df= pd.DataFrame({'Clone_count_1':n1_samples,'Clone_count_2':n2_samples}) 679 680 pair_samples_df['Clone_frequency_1'] = pair_samples_df['Clone_count_1'] / np.sum(pair_samples_df['Clone_count_1']) 681 pair_samples_df['Clone_frequency_2'] = pair_samples_df['Clone_count_2'] / np.sum(pair_samples_df['Clone_count_2']) 682 683 684 return pair_samples_df 685 686#========================================================================================================================== 687 688 689#===============================Longitudinal-Data-Pre-Processing=================================== 690 691class longitudinal_analysis(): 692 693 """ 694 This class provides some tool to inspect and compute some simple statistics on longitudinal data associated with 695 one individual (it is independent of the NoisET software). 696 697 ... 698 Attributes 699 ---------- 700 clone_count_label : str 701 label in the clonotype tables indicating the clonotype count 702 seq_label : str 703 label in the clonotype tables indicating the sequence of the receptor 704 clones : dict of pandas.DataFrame 705 dictionary containing the clonotype tables as pandas frames. The keys are 706 strings "patient_time", replicated are merged. Created in the initalization 707 times : list of float 708 ordered times of the imported tables. Created in the initialization 709 unique_clones : list of str 710 list of all the unique clonotype sequences in all the time points 711 time_occurrence : list of int 712 number of time points in which each clonotype appears. The index 713 refers to the clonotype in the unique_clones list 714 Methods 715 ------- 716 compute_clone_time_occurrence() 717 It creates two new attribues: the list of uniqe clonotypes in all the dataset 718 "self.unique_clones" and the time occurrence of each of them "self.time_occurrence". 719 the time occurrence is the number of time points in which the clone appears. 720 plot_hist_persistence(figsize=(12,10)) 721 It plots the distribution of time occurrence of the unique clonotypes 722 top_clones_set(n_top_clones) 723 Compute the set of top clones as the union of the "n_top_clones" most abundant 724 clonotype in each time point 725 build_traj_frame(top_clones_set) 726 Compute the set of top clones as the union of the "n_top_clones" most abundant 727 clonotype in each time point 728 plot_trajectories(n_top_clones, colormap='viridis', figsize=(12,10)) 729 Function to plot the trajectories of the first "n_top_clones". Colors of the 730 trajectories represent the cumulative frequency in all the time points. 731 PCA_traj(n_top_clones, nclus=4) 732 Perform PCA over the normalized trajectories of n_top_clones TCR clones. 733 The normalization consists in dividing the whole trajectory by its maximum value. 734 After PCA the trajectories are clustered in the two principal componets space 735 with a hierarchical clustering algorithm. 736 plot_PCA2(n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)) 737 Plotting the trajectories in the space of their two principal components and 738 clustering them as in "PCA_traj". 739 plot_PCA_clusters_traj(n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)) 740 Plotting the trajectories grouped by PCA clusters 741 """ 742 743 744 745 746 def __init__(self, patient, data_folder, sequence_label='N. Seq. CDR3', clone_count_label='Clone count', 747 replicate1_label='_F1', replicate2_label='_F2', separator='\t'): 748 """ 749 Import all the clonotypes of a given patient and store them in the dictionary "self.clones". 750 It also creates the list of times "self.times". During this process the replicates at the 751 same time points are merged together. 752 The names of the tables containing TCR should be structured as "patient_time_replicate.csv". 753 Those tables should be cvs files compressed in a zip archive (see the example notebook). 754 Parameters 755 ---------- 756 patient : str 757 The ID of the patient 758 data_folder : str 759 folder name containing the csv files listing the T-cell receptors 760 separator : str 761 separator symbol in the csv tables 762 """ 763 764 self.clone_count_label = clone_count_label 765 self.seq_label = sequence_label 766 self.unique_clones = None 767 self.time_occurrence = None 768 self.times = [] 769 clones_repl = dict() 770 771 # Iteration over all the file in the folder for importing each table 772 for file_name in os.listdir(data_folder): 773 # If the name before the underscore corresponds to the chosen patient.. 774 if file_name.split('_')[0] == patient: 775 # Import the table 776 frame = pd.read_csv(data_folder+file_name, sep='\t', compression=dict(method='zip')) 777 # Store it in a dictionary where the key contains the patient, the time 778 # and the replicate. 779 clones_repl[file_name[:-10]] = frame 780 # Reading the time from the name and storing it 781 self.times.append(int(file_name.split('_')[1])) 782 print('Clonotypes',file_name[:-10],'imported') 783 784 # Sorting the unique times 785 self.times = np.sort(list(set(self.times))) 786 self.clones = self._merge_replicates(patient, clones_repl, replicate1_label, replicate2_label) 787 788 789 def _merge_replicates(self, patient, clones_repl, repl1_label, repl2_label): 790 791 clones_merged = dict() 792 793 # Iteration over the times 794 for it, t in enumerate(self.times): 795 # Building the ids correponding at 1st and 2nd replicate at given time point 796 id_F1 = patient + '_' + str(t) + repl1_label 797 id_F2 = patient + '_' + str(t) + repl2_label 798 # Below all the rows of one table are appended to the rows of the other 799 merged_replicates = clones_repl[id_F1].merge(clones_repl[id_F2], how='outer') 800 # But there are common clonotypes that now appear in two different rows 801 # (one for the first and one for the second replicate)! 802 # Below we collapse those common sequences and the counts of the two are summed 803 merged_replicates = merged_replicates.groupby(self.seq_label, as_index=False).agg({self.clone_count_label:sum}) 804 depth = merged_replicates[self.clone_count_label].sum() 805 merged_replicates['Clone freq'] = merged_replicates[self.clone_count_label] / depth 806 merged_replicates = merged_replicates.sort_values('Clone freq', ascending=False) 807 # The merged table is then added to the dictionary 808 clones_merged[patient + '_' + str(t)] = merged_replicates 809 810 return clones_merged 811 812 813 def compute_clone_time_occurrence(self): 814 815 """ 816 It creates two new attribues: the list of uniqe clonotypes in all the dataset 817 "self.unique_clones" and the time occurrence of each of them "self.time_occurrence". 818 the time occurrence is the number of time points in which the clone appears. 819 """ 820 821 all_clones = np.array([]) 822 for id_, cl in self.clones.items(): 823 all_clones = np.append(all_clones, cl[self.seq_label].values) 824 825 # The following function returns the list of unique clonotypes and the number of 826 # repetitions for each of them. 827 # Note that the number of repetitions is exactly the time occurrence 828 self.unique_clones, self.time_occurrence = np.unique(all_clones, return_counts=True) 829 830 831 def plot_hist_persistence(self, figsize=(12,10)): 832 833 """ 834 It plots the distribution of time occurrence of the unique clonotypes 835 Parameters 836 ---------- 837 figsize : tuple 838 width, height in inches 839 840 Returns 841 ------- 842 ax : matplotlib.axes._subplots.AxesSubplot 843 axes where to draw the plot 844 fig : matplotlib.figure.Figure 845 matplotlib figure 846 """ 847 848 if type(self.unique_clones) != np.ndarray: 849 self.compute_clone_time_occurrence() 850 851 fig, ax = plt.subplots(figsize=figsize) 852 853 plt.rc('xtick', labelsize = 30) 854 plt.rc('ytick', labelsize = 30) 855 856 ax.set_yscale('log') 857 ax.set_xlabel('Time occurrence', fontsize = 30) 858 ax.set_ylabel('Counts', fontsize = 30) 859 ax.hist(self.time_occurrence, bins=np.arange(1,len(self.times)+2)-0.5, rwidth=0.6) 860 861 return fig, ax 862 863 864 def top_clones_set(self, n_top_clones): 865 866 """ 867 Compute the set of top clones as the union of the "n_top_clones" most abundant 868 clonotype in each time point 869 Parameters 870 ---------- 871 n_top_clones : int 872 number of most abundant clontypes in each time point 873 Returns 874 ------- 875 top_clones : set of str 876 set of top clones 877 """ 878 879 top_clones = set() 880 for id_, cl in self.clones.items(): 881 top_clones_at_time = cl.sort_values(self.clone_count_label, ascending=False)[:n_top_clones] 882 top_clones = top_clones.union(top_clones_at_time[self.seq_label].values) 883 return top_clones 884 885 886 def build_traj_frame(self, clone_set): 887 888 """ 889 This builds a dataframe containing the frequency at all the time points for each 890 of the clonotypes specified in clone_set. 891 The dataframe has also a field that contains the cumulative frequency. 892 Parameters 893 ---------- 894 clones_set : iterable of str 895 list of clonotypes whose temporal trajectory is drawn 896 Returns 897 ------- 898 traj_frame : pandas.DataFrame 899 dataframe containing the frequency at all the time points 900 """ 901 902 traj_frame = pd.DataFrame(index=clone_set) 903 traj_frame['Clone cumul freq'] = 0 904 905 for id_, cl in self.clones.items(): 906 907 # Getting the time from the index of clones_merged 908 t = id_.split('_')[1] 909 # Selecting the clonotypes that are both in the frame at the given time 910 # point and in the list of top_clones_set 911 top_clones_at_time = clone_set.intersection(set(cl[self.seq_label])) 912 # Creating a sub-dataframe containing only the clone in top_clones_at_time 913 clones_at_time = cl.set_index(self.seq_label).loc[top_clones_at_time] 914 # Creating a new column in the trajectory frames for the counts at that time 915 traj_frame['t'+str(t)] = traj_frame.index.map(clones_at_time['Clone freq'].to_dict()) 916 # The clonotypes not present at that time are NaN. Below we convert NaN in 0s 917 traj_frame = traj_frame.fillna(0) 918 # The cumulative count for each clonotype is updated 919 traj_frame['Clone cumul freq'] += traj_frame['t'+str(t)] 920 921 return traj_frame 922 923 924 925 # Plot clonal trajectories 926 927 928 def plot_trajectories(self, n_top_clones, colormap='viridis', figsize=(12,10)): 929 930 """ 931 Function to plot the trajectories of the first "n_top_clones". Colors of the 932 trajectories represent the cumulative frequency in all the time points. 933 934 Parameters 935 ---------- 936 n_top_clones : int 937 number of most abundant clontypes in each time point 938 colormap : str 939 colors of the trajectories 940 941 figsize : tuple 942 width, height in inches 943 Returns 944 ------- 945 ax : matplotlib.axes._subplots.AxesSubplot 946 axes where to draw the plot 947 fig : matplotlib.figure.Figure 948 matplotlib figure 949 """ 950 951 cmap = cm.get_cmap(colormap) 952 top_clones = self.top_clones_set(n_top_clones) 953 traj_frame = self.build_traj_frame(top_clones) 954 955 fig, ax = plt.subplots(figsize=figsize) 956 plt.rc('xtick', labelsize = 30) 957 plt.rc('ytick', labelsize = 30) 958 ax.set_yscale('log') 959 ax.set_xlabel('time', fontsize = 25) 960 ax.set_ylabel('frequency', fontsize = 25) 961 962 log_counts = np.log10(traj_frame['Clone cumul freq'].values) 963 max_log_count = max(log_counts) 964 min_log_count = min(log_counts) 965 966 for id_, row in traj_frame.iterrows(): 967 traj = row.drop(['Clone cumul freq']).to_numpy() 968 log_count = np.log10(row['Clone cumul freq']) 969 norm_log_count = (log_count-min_log_count)/(max_log_count-min_log_count) 970 plt.plot(self.times, traj, c=cmap(norm_log_count)) 971 972 973 sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=min(log_counts), vmax=max(log_counts))) 974 cb = plt.colorbar(sm) 975 cb.set_label('Log10 cumulative frequency', fontsize = 25) 976 977 return fig, ax 978 979 980 def PCA_traj(self, n_top_clones, nclus=4): 981 982 """ 983 Perform PCA over the normalized trajectories of n_top_clones TCR clones. 984 The normalization consists in dividing the whole trajectory by its maximum value. 985 After PCA the trajectories are clustered in the two principal componets space 986 with a hierarchical clustering algorithm. 987 988 Parameters 989 ---------- 990 n_top_clones : int 991 number of most abundant clontypes in each time point to consider in the PCA 992 nclus : float 993 number of clusters 994 995 Returns 996 ------- 997 pca : sklearn.decomposition._pca.PCA 998 object containing the result of the principal component analysis 999 1000 clustering : sklearn.cluster._agglomerative.AgglomerativeClustering 1001 object containing the result of the hierarchical clustering 1002 """ 1003 1004 #Getting the top n_top_clones clonotypes at each time point 1005 top_clones = self.top_clones_set(n_top_clones) 1006 #Building a trajectory dataframe 1007 traj_frame = self.build_traj_frame(top_clones) 1008 1009 #Converting it in a numpy matrix 1010 traj_matrix = traj_frame.drop(['Clone cumul freq'], axis = 1).to_numpy() 1011 1012 # Normalize each trajectory by its maximum 1013 norm_traj_matrix = traj_matrix/np.max(traj_matrix, axis=1)[:, np.newaxis] 1014 1015 pca = PCA(n_components =2).fit(norm_traj_matrix.T) 1016 clustering = AgglomerativeClustering(n_clusters = nclus) 1017 clustering = clustering.fit(pca.components_.T) 1018 1019 return pca, clustering 1020 1021 1022 def plot_PCA2(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)): 1023 1024 """ 1025 Plotting the trajectories in the space of their two principal components and 1026 clustering them as in "PCA_traj". 1027 1028 Parameters 1029 ---------- 1030 n_top_clones : int 1031 number of most abundant clontypes in each time point to consider in the PCA 1032 nclus : float 1033 number of clusters 1034 colormap : str 1035 colormap indicating the different clusters 1036 figsize : tuple 1037 width, height in inches 1038 Returns 1039 ------- 1040 ax : matplotlib.axes._subplots.AxesSubplot 1041 axes where to draw the plot 1042 fig : matplotlib.figure.Figure 1043 matplotlib figure 1044 """ 1045 1046 1047 cmap = cm.get_cmap(colormap) 1048 pca, clustering = self.PCA_traj(n_top_clones, nclus) 1049 1050 fig, ax = plt.subplots(figsize=figsize) 1051 ax.set_title('PCA components (%i trajs)' %pca.n_features_, fontsize = 25) 1052 ax.set_xlabel('First component (expl var: %3.2f)'%pca.explained_variance_ratio_[0], fontsize = 25) 1053 ax.set_ylabel('Second component (expl var: %3.2f)'%pca.explained_variance_ratio_[1], fontsize = 25) 1054 for c_ind in range(clustering.n_clusters): 1055 x = pca.components_[0][clustering.labels_ == c_ind] 1056 y = pca.components_[1][clustering.labels_ == c_ind] 1057 ax.scatter(x, y, alpha=0.2, color=cmap(c_ind/clustering.n_clusters)) 1058 1059 return fig, ax 1060 1061 1062 def plot_PCA_clusters_traj(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)): 1063 1064 """ 1065 Plotting the trajectories grouped by PCA clusters 1066 1067 Parameters 1068 ---------- 1069 n_top_clones : int 1070 number of most abundant clontypes in each time point to consider in the PCA 1071 nclus : float 1072 number of clusters 1073 colormap : str 1074 colormap indicating the different clusters 1075 figsize : tuple 1076 width, height in inches 1077 Returns 1078 ------- 1079 axs : tuple of matplotlib.axes._subplots.AxesSubplot 1080 axis where to draw the plot 1081 fig : matplotlib.figure.Figure 1082 matplotlib figure 1083 """ 1084 1085 cmap = cm.get_cmap(colormap) 1086 pca, clustering = self.PCA_traj(n_top_clones, nclus) 1087 1088 n_cl = clustering.n_clusters 1089 1090 #Getting the top n_top_clones clonotypes at each time point 1091 top_clones = self.top_clones_set(n_top_clones) 1092 #Building a trajectory dataframe 1093 traj_frame = self.build_traj_frame(top_clones) 1094 1095 #Converting it in a numpy matrix 1096 traj_matrix = traj_frame.drop(['Clone cumul freq'], axis=1).to_numpy() 1097 1098 # Normalize each trajectory by its maximum 1099 norm_traj_matrix = traj_matrix/np.max(traj_matrix, axis=1)[:, np.newaxis] 1100 1101 fig, axs = plt.subplots(2, n_cl, figsize=(5*n_cl, 12)) 1102 for cl in range(n_cl): 1103 trajs = norm_traj_matrix[clustering.labels_ == cl] 1104 axs[0][cl].set_xlabel('Time', fontsize = 15) 1105 axs[0][cl].set_ylabel('Normalized frequency', fontsize = 15) 1106 axs[1][cl].set_xlabel('Time', fontsize = 15) 1107 axs[1][cl].set_ylabel('Normalized frequency', fontsize = 15) 1108 for traj in trajs: 1109 axs[0][cl].plot(self.times, traj, alpha=0.2, color=cmap(cl/n_cl)) 1110 axs[1][cl].set_ylim(0,1) 1111 axs[1][cl].errorbar(self.times, np.mean(trajs, axis=0), 1112 yerr=np.std(trajs, axis=0), lw=3, color=cmap(cl/n_cl)) 1113 #axs[1][cl].fill_between(times, np.quantile(trajs, 0.75, axis=0), np.quantile(trajs, 0.25, axis=0), color=colors[cl]) 1114 1115 plt.tight_layout() 1116 return fig, axs 1117 1118#===============================Data-Pre-Processing=================================== 1119 1120class Data_Process(): 1121 1122 """ 1123 A class used to represent longitudinal RepSeq data and pre-analysis of the longitudinal data associated with 1124 one individual. 1125 ... 1126 Attributes 1127 ---------- 1128 path : str 1129 the name of the path to get access to the data files to use for our analysis 1130 filename1 : str 1131 the name of the file of the RepSeq sample which can be the first replicate when deciphering the experimental noise 1132 or the first time point RepSeq sample when analysing responding clones to a stimulus between two time points. 1133 filename2 : str 1134 the name of the file of the RepSeq sample which can be the second replicate when deciphering the experimental noise 1135 or the second time point RepSeq sample when analysing responding clones to a stimulus between two time points. 1136 colnames1 : str 1137 list of columns names of data-set - first sample 1138 colnames2 : str 1139 list of columns names of data-set - second sample 1140 Methods 1141 ------- 1142 import_data() : 1143 to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the 1144 union of both samples. 1145 1146 """ 1147 1148 def __init__(self, path, filename1, filename2, colnames1, colnames2): 1149 1150 self.path = path 1151 self.filename1 = filename1 1152 self.filename2 = filename2 1153 self.colnames1 = colnames1 1154 self.colnames2 = colnames2 1155 1156 1157 def import_data(self): 1158 """ 1159 to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the union of both samples. 1160 1161 Parameters 1162 ---------- 1163 NONE 1164 Returns 1165 ------- 1166 number_clones 1167 numpy array, number of clones in the data frame which is the union of the two RepSeq used as entries of the function 1168 df 1169 pandas data-frame which is the data-frame containing the informations labeled in colnames vector string 1170 for both RepSeq samples taken as input. 1171 """ 1172 1173 mincount = 0 1174 maxcount = np.inf 1175 1176 headerline=0 #line number of headerline 1177 newnames=['Clone_fraction','Clone_count','ntCDR3','AACDR3'] 1178 1179 if self.filename1[-2:] == 'gz': 1180 F1Frame_chunk=pd.read_csv(self.path + self.filename1, delimiter='\t',usecols=self.colnames1,header=headerline, compression = 'gzip')[self.colnames1] 1181 else: 1182 F1Frame_chunk=pd.read_csv(self.path + self.filename1, delimiter='\t',usecols=self.colnames1,header=headerline)[self.colnames1] 1183 1184 if self.filename2[-2:] == 'gz': 1185 F2Frame_chunk=pd.read_csv(self.path + self.filename2, delimiter='\t',usecols=self.colnames2,header=headerline, compression = 'gzip')[self.colnames2] 1186 1187 else: 1188 F2Frame_chunk=pd.read_csv(self.path + self.filename2, delimiter='\t',usecols=self.colnames2,header=headerline)[self.colnames2] 1189 1190 F1Frame_chunk.columns=newnames 1191 F2Frame_chunk.columns=newnames 1192 suffixes=('_1','_2') 1193 mergedFrame=pd.merge(F1Frame_chunk,F2Frame_chunk,on=newnames[2],suffixes=suffixes,how='outer') 1194 for nameit in [0,1]: 1195 for labelit in suffixes: 1196 mergedFrame.loc[:,newnames[nameit]+labelit].fillna(int(0),inplace=True) 1197 if nameit==1: 1198 mergedFrame.loc[:,newnames[nameit]+labelit].astype(int) 1199 def dummy(x): 1200 val=x[0] 1201 if pd.isnull(val): 1202 val=x[1] 1203 return val 1204 mergedFrame.loc[:,newnames[3]+suffixes[0]]=mergedFrame.loc[:,[newnames[3]+suffixes[0],newnames[3]+suffixes[1]]].apply(dummy,axis=1) #assigns AA sequence to clones, creates duplicates 1205 mergedFrame.drop(newnames[3]+suffixes[1], 1,inplace=True) #removes duplicates 1206 mergedFrame.rename(columns = {newnames[3]+suffixes[0]:newnames[3]}, inplace = True) 1207 mergedFrame=mergedFrame[[newname+suffix for newname in newnames[:2] for suffix in suffixes]+[newnames[2],newnames[3]]] 1208 filterout=((mergedFrame.Clone_count_1<mincount) & (mergedFrame.Clone_count_2==0)) | ((mergedFrame.Clone_count_2<mincount) & (mergedFrame.Clone_count_1==0)) #has effect only if mincount>0 1209 number_clones=len(mergedFrame) 1210 return number_clones,mergedFrame.loc[((mergedFrame.Clone_count_1<=maxcount) & (mergedFrame.Clone_count_2<=maxcount)) & ~filterout] 1211 1212 1213 1214 1215#===============================Noise-Model=================================================================== 1216### Noise Model 1217 1218class Noise_Model(): 1219 1220 """ 1221 A class used to build an object associated to methods in order to learn the experimental noise from same day 1222 biological RepSeq samples. 1223 ... 1224 Methods 1225 ------- 1226 get_sparserep(df) : 1227 get sparse representation of the abundances / frequencies of the TCR clones present in both RepSeq samples of interest. 1228 this changes the data input to fasten the algorithm 1229 learn_null_model(df, noise_model, init_paras, output_dir = None, filename = None, display_loss_function = False) : 1230 function to optimize the likelihood associated to the experimental noise model and get the associated parameters. 1231 diversity_estimate(df, paras, noise_model) : 1232 function to get the estimation of diversity from the noise model information. 1233 """ 1234 1235 1236 def get_sparserep(self, df): 1237 """ 1238 Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation. 1239 unicountvals_1(2) are the unique values of n1(2). 1240 sparse_rep_counts gives the counts of unique pairs. 1241 ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair. 1242 len(indn1)=len(indn2)=len(sparse_rep_counts) 1243 Parameters 1244 ---------- 1245 df : pandas data frame 1246 data-frame which is the output of the method .import_data() for one Data_Process instance. 1247 these data-frame should give the list of TCR clones present in two replicates RepSeq samples 1248 associated to their clone frequencies and clone abundances in the first and second replicate. 1249 Returns 1250 ------- 1251 indn1 1252 numpy array list of indexes of all values of unicountvals_1 1253 indn2 1254 numpy array list of indexes of all values of unicountvals_2 1255 sparse_rep_counts 1256 numpy array, # of clones having the read counts pair {(n1,n2)} 1257 unicountvals_1 1258 numpy array list of unique counts values present in the first sample in df[clone_count_1] 1259 unicountvals_2 1260 numpy array list of unique counts values present in the second sample in df[clone_count_2] 1261 Nreads1 1262 float, total number of counts/reads in the first sample referred in df by "_1" 1263 Nreads2 1264 float, total number of counts/reads in the second sample referred in df by "_2" 1265 """ 1266 1267 counts = df.loc[:,['Clone_count_1', 'Clone_count_2']] 1268 counts['paircount'] = 1 # gives a weight of 1 to each observed clone 1269 1270 clone_counts = counts.groupby(['Clone_count_1', 'Clone_count_2']).sum() 1271 sparse_rep_counts = np.asarray(clone_counts.values.flatten(), dtype=int) 1272 clonecountpair_vals = clone_counts.index.values 1273 indn1 = np.asarray([clonecountpair_vals[it][0] for it in range(len(sparse_rep_counts))], dtype=int) 1274 indn2 = np.asarray([clonecountpair_vals[it][1] for it in range(len(sparse_rep_counts))], dtype=int) 1275 NreadsI = np.sum(counts['Clone_count_1']) 1276 NreadsII = np.sum(counts['Clone_count_2']) 1277 1278 unicountvals_1, indn1 = np.unique(indn1, return_inverse=True) 1279 unicountvals_2, indn2 = np.unique(indn2, return_inverse=True) 1280 1281 return indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII 1282 1283 1284 1285 def _NegBinPar(self,m,v,mvec): 1286 ''' 1287 Same as NegBinParMtr, but for m and v being scalars. 1288 Assumes m>0. 1289 Output is (len(mvec),) array 1290 ''' 1291 mmax=mvec[-1] 1292 p = 1-m/v 1293 r = m*m/v/p 1294 NBvec=np.arange(mmax+1,dtype=float) 1295 NBvec[1:]=np.log((NBvec[1:]+r-1)/NBvec[1:]*p) #vectorization won't help unfortuneately here since log needs to be over array 1296 NBvec[0]=r*math.log(m/v) 1297 NBvec=np.exp(np.cumsum(NBvec)[mvec]) #save a bit here 1298 return NBvec 1299 1300 def _NegBinParMtr(self,m,v,nvec): #speed up only insofar as the log and exp are called once on array instead of multiple times on rows 1301 ''' 1302 computes NegBin probabilities over the ordered (but possibly discontiguous) vector (nvec) 1303 for mean/variance combinations given by the mean (m) and variance (v) vectors. 1304 Note that m<v for negative binomial. 1305 Output is (len(m),len(nvec)) array 1306 ''' 1307 nmax=nvec[-1] 1308 p = 1-m/v 1309 r = m*m/v/p 1310 NBvec=np.arange(nmax+1,dtype=float) 1311 NBvec=np.log((NBvec+r[:,np.newaxis]-1)*(p[:,np.newaxis]/NBvec)) 1312 NBvec[:,0]=r*np.log(m/v) #handle NBvec[0]=0, treated specially when m[0]=0, see below 1313 NBvec=np.exp(np.cumsum(NBvec,axis=1)) #save a bit here 1314 if m[0]==0: 1315 NBvec[0,:]=0. 1316 NBvec[0,0]=1. 1317 NBvec=NBvec[:,nvec] 1318 return NBvec 1319 1320 def _PoisPar(self, Mvec,unicountvals): 1321 #assert Mvec[0]==0, "first element needs to be zero" 1322 nmax=unicountvals[-1] 1323 nlen=len(unicountvals) 1324 mlen=len(Mvec) 1325 Nvec=unicountvals 1326 logNvec=-np.insert(np.cumsum(np.log(np.arange(1,nmax+1))),0,0.)[unicountvals] #avoid n=0 nans 1327 Nmtr=np.exp(Nvec[np.newaxis,:]*np.log(Mvec)[:,np.newaxis]+logNvec[np.newaxis,:]-Mvec[:,np.newaxis]) # np.log(Mvec) throws warning: since log(0)=-inf 1328 if Mvec[0]==0: 1329 Nmtr[0,:]=np.zeros((nlen,)) #when m=0, n=0, and so get rid of nans from log(0) 1330 Nmtr[0,0]=1. #handled belowacq_model_type 1331 if unicountvals[0]==0: #if n=0 included get rid of nans from log(0) 1332 Nmtr[:,0]=np.exp(-Mvec) 1333 return Nmtr 1334 1335 def _get_rhof(self,alpha_rho, nfbins,fmin,freq_dtype): 1336 ''' 1337 generates power law (power is alpha_rho) clone frequency distribution over 1338 freq_nbins discrete logarithmically spaced frequences between fmin and 1 of dtype freq_dtype 1339 Outputs log probabilities obtained at log frequencies''' 1340 fmax=1e0 1341 logfvec=np.linspace(np.log10(fmin),np.log10(fmax), nfbins) 1342 logfvec=np.array(np.log(np.power(10,logfvec)) ,dtype=freq_dtype).flatten() 1343 logrhovec=logfvec*alpha_rho 1344 integ=np.exp(logrhovec+logfvec,dtype=freq_dtype) 1345 normconst=np.log(np.dot(np.diff(logfvec)/2.,integ[1:]+integ[:-1])) 1346 logrhovec-=normconst 1347 return logrhovec,logfvec, normconst 1348 1349 1350 def _get_logPn_f(self,unicounts,Nreads,logfvec, noise_model, paras): 1351 1352 """ 1353 tools to compute the likelihood of the noise model. It is not useful for the user. 1354 """ 1355 1356 # Choice of the model: 1357 1358 if noise_model<1: 1359 1360 m_total=float(np.power(10, paras[3])) 1361 r_c=Nreads/m_total 1362 if noise_model<2: 1363 1364 beta_mv= paras[1] 1365 alpha_mv=paras[2] 1366 1367 if noise_model<1: #for models that include cell counts 1368 #compute parametrized range (mean-sigma,mean+5*sigma) of m values (number of cells) conditioned on n values (reads) appearing in the data only 1369 nsigma=5. 1370 nmin=300. 1371 #for each n, get actual range of m to compute around n-dependent mean m 1372 m_low =np.zeros((len(unicounts),),dtype=int) 1373 m_high=np.zeros((len(unicounts),),dtype=int) 1374 for nit,n in enumerate(unicounts): 1375 mean_m=n/r_c 1376 dev=nsigma*np.sqrt(mean_m) 1377 m_low[nit] =int(mean_m- dev) if (mean_m>dev**2) else 0 1378 m_high[nit]=int(mean_m+5*dev) if ( n>nmin) else int(10*nmin/r_c) 1379 m_cellmax=np.max(m_high) 1380 #across n, collect all in-range m 1381 mvec_bool=np.zeros((m_cellmax+1,),dtype=bool) #cheap bool 1382 nvec=range(len(unicounts)) 1383 for nit in nvec: 1384 mvec_bool[m_low[nit]:m_high[nit]+1]=True #mask vector 1385 mvec=np.arange(m_cellmax+1)[mvec_bool] 1386 #transform to in-range index 1387 for nit in nvec: 1388 m_low[nit]=np.where(m_low[nit]==mvec)[0][0] 1389 m_high[nit]=np.where(m_high[nit]==mvec)[0][0] 1390 1391 Pn_f=np.zeros((len(logfvec),len(unicounts))) 1392 if noise_model==0: 1393 1394 mean_m=m_total*np.exp(logfvec) 1395 var_m=mean_m+beta_mv*np.power(mean_m,alpha_mv) 1396 Poisvec = self._PoisPar(mvec*r_c,unicounts) 1397 for f_it in range(len(logfvec)): 1398 NBvec=self._NegBinPar(mean_m[f_it],var_m[f_it],mvec) 1399 for n_it,n in enumerate(unicounts): 1400 Pn_f[f_it,n_it]=np.dot(NBvec[m_low[n_it]:m_high[n_it]+1],Poisvec[m_low[n_it]:m_high[n_it]+1,n_it]) 1401 1402 elif noise_model==1: 1403 1404 mean_n=Nreads*np.exp(logfvec) 1405 var_n=mean_n+beta_mv*np.power(mean_n,alpha_mv) 1406 Pn_f = self._NegBinParMtr(mean_n,var_n,unicounts) 1407 elif noise_model==2: 1408 1409 mean_n=Nreads*np.exp(logfvec) 1410 Pn_f= self._PoisPar(mean_n,unicounts) 1411 else: 1412 print('acq_model is 0,1, or 2 only') 1413 1414 return np.log(Pn_f) 1415 1416 #-----------------------------Null-Model-optimization-------------------------- 1417 1418 def _get_Pn1n2(self, paras, sparse_rep, noise_model): 1419 1420 """ 1421 Tool to compute likelihood of the noise model. It is not useful for the user. 1422 """ 1423 1424 indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2,NreadsI,NreadsII = sparse_rep 1425 1426 nfbins = 1200 1427 freq_dtype = float 1428 1429 # Parameters 1430 1431 alpha = paras[0] 1432 fmin = np.power(10,paras[-1]) 1433 1434 # 1435 logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype) 1436 1437 # 1438 1439 logfvec_tmp=deepcopy(logfvec) 1440 1441 logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI,logfvec_tmp, noise_model, paras) 1442 logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII,logfvec_tmp, noise_model, paras) 1443 1444 # for the trapezoid integral methods 1445 1446 dlogfby2=np.diff(logfvec)/2 1447 1448 # Compute P(0,0) for the normalization constraint 1449 integ = np.exp(logrhofvec + logPn2_f[:, 0] + logPn1_f[:, 0] + logfvec) 1450 Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1]) 1451 1452 #print("computing P(n1,n2)") 1453 Pn1n2 = np.zeros(len(sparse_rep_counts)) # 1D representation 1454 for it, (ind1, ind2) in enumerate(zip(indn1, indn2)): 1455 integ = np.exp(logPn1_f[:, ind1] + logrhofvec + logPn2_f[:, ind2] + logfvec) 1456 Pn1n2[it] = np.dot(dlogfby2, integ[1:] + integ[:-1]) 1457 Pn1n2 /= 1. - Pn0n0 # renormalize 1458 return -np.dot(sparse_rep_counts, np.where(Pn1n2 > 0, np.log(Pn1n2), 0)) / float(np.sum(sparse_rep_counts)) 1459 1460 1461 1462 1463 def _callback(self, paras, nparas, sparse_rep, noise_model): 1464 '''prints iteration info. called by scipy.minimize. Not useful for the user.''' 1465 1466 global curr_iter 1467 #curr_iter = 0 1468 global Loss_function 1469 print(''.join(['{0:d} ']+['{'+str(it)+':3.6f} ' for it in range(1,len(paras)+1)]).format(*([curr_iter]+list(paras)))) 1470 #print ('{' + str(len(paras)+1) + ':3.6f}'.format( [self.get_Pn1n2(paras, sparse_rep, acq_model_type)])) 1471 Loss_function = self._get_Pn1n2(paras, sparse_rep, noise_model) 1472 print(Loss_function) 1473 curr_iter += 1 1474 1475 1476 1477 # Constraints for the Null-Model, no filtered 1478 def _nullmodel_constr_fn(self, paras, sparse_rep, noise_model, constr_type): 1479 1480 ''' 1481 returns either or both of the two level-set functions: log<f>-log(1/N), with N=Nclones/(1-P(0,0)) and log(Z_f), with Z_f=N<f>_{n+n'=0} + sum_i^Nclones <f>_{f|n,n'} 1482 not useful for the user 1483 ''' 1484 1485 # Choice of the model: 1486 1487 indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII = sparse_rep 1488 1489 #Variables that would be chosen in the future by the user 1490 nfbins = 1200 1491 freq_dtype = float 1492 1493 alpha = paras[0] # power law exponent 1494 fmin = np.power(10, paras[-1]) # true minimal frequency 1495 1496 logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype) 1497 dlogfby2 = np.diff(logfvec) / 2. # 1/2 comes from trapezoid integration below 1498 1499 integ = np.exp(logrhofvec + 2 * logfvec) 1500 avgf_ps = np.dot(dlogfby2, integ[:-1] + integ[1:]) 1501 1502 logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI, logfvec, noise_model, paras) 1503 logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII, logfvec, noise_model, paras) 1504 1505 integ = np.exp(logPn1_f[:, 0] + logPn2_f[:, 0] + logrhofvec + logfvec) 1506 Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1]) 1507 logPnng0 = np.log(1 - Pn0n0) 1508 avgf_null_pair = np.exp(logPnng0 - np.log(np.sum(sparse_rep_counts))) 1509 1510 C1 = np.log(avgf_ps) - np.log(avgf_null_pair) 1511 1512 integ = np.exp(logPn1_f[:, 0] + logPn2_f[:, 0] + logrhofvec + 2 * logfvec) 1513 log_avgf_n0n0 = np.log(np.dot(dlogfby2, integ[1:] + integ[:-1])) 1514 1515 integ = np.exp(logPn1_f[:, indn1] + logPn2_f[:, indn2] + logrhofvec[:, np.newaxis] + logfvec[:, np.newaxis]) 1516 log_Pn1n2 = np.log(np.sum(dlogfby2[:, np.newaxis] * (integ[1:, :] + integ[:-1, :]), axis=0)) 1517 integ = np.exp(np.log(integ) + logfvec[:, np.newaxis]) 1518 tmp = deepcopy(log_Pn1n2) 1519 tmp[tmp == -np.Inf] = np.Inf # since subtracted in next line 1520 avgf_n1n2 = np.exp(np.log(np.sum(dlogfby2[:, np.newaxis] * (integ[1:, :] + integ[:-1, :]), axis=0)) - tmp) 1521 log_sumavgf = np.log(np.dot(sparse_rep_counts, avgf_n1n2)) 1522 1523 logNclones = np.log(np.sum(sparse_rep_counts)) - logPnng0 1524 Z = np.exp(logNclones + np.log(Pn0n0) + log_avgf_n0n0) + np.exp(log_sumavgf) 1525 1526 C2 = np.log(Z) 1527 1528 1529 # print('C1:'+str(C1)+' C2:'+str(C2)) 1530 if constr_type == 0: 1531 return C1 1532 elif constr_type == 1: 1533 return C2 1534 else: 1535 return C1, C2 1536 1537 1538 1539 # Null-Model optimization learning 1540 1541 def learn_null_model(self, df, noise_model, init_paras, output_dir = None, filename = None, display_loss_function = False): # constraint type 1 gives only low error modes, see paper for details. 1542 """ 1543 Parameters 1544 ---------- 1545 df : pandas data frame 1546 data-frame which is the output of the method .import_data() for one Data_Process instance. 1547 these data-frame should give the list of TCR clones present in two replicates RepSeq samples 1548 associated to their clone frequencies and clone abundances in the first and second replicate. 1549 noise_model: numpy array 1550 choice of noise model 1551 init_paras: numpy array 1552 initial vector of parameters to start the optimization of the model from data (df) 1553 output_dir : str 1554 default value is None, it is the output directory name i which we want to save the values of the parameters 1555 display_loss_function : bool 1556 boolean variable to chose if we want to print the loss function during the experimental noise learning, default value is 1557 None. 1558 1559 Returns 1560 ------- 1561 outstruct 1562 numpy array parameters of the noise model 1563 constr_value 1564 float, value of the constraint 1565 1566 """ 1567 1568 # Data introduction 1569 sparse_rep = self.get_sparserep(df) 1570 constr_type = 1 1571 1572 # Choice of the model: 1573 # Parameters initialization depending on the model 1574 if noise_model < 1: 1575 parameter_labels = ['alph_rho', 'beta', 'alpha', 'm_total', 'fmin'] 1576 elif noise_model == 1: 1577 parameter_labels = ['alph_rho', 'beta', 'alpha', 'fmin'] 1578 else: 1579 parameter_labels = ['alph_rho', 'fmin'] 1580 1581 assert len(parameter_labels) == len(init_paras), "number of model and initial paras differ!" 1582 1583 condict = {'type': 'eq', 'fun': self._nullmodel_constr_fn, 'args': (sparse_rep, noise_model, constr_type)} 1584 1585 1586 partialobjfunc = partial(self._get_Pn1n2, sparse_rep=sparse_rep, noise_model=noise_model) 1587 nullfunctol = 1e-6 1588 nullmaxiter = 200 1589 header = ['Iter'] + parameter_labels 1590 print(''.join(['{' + str(it) + ':9s} ' for it in range(len(init_paras) + 1)]).format(*header)) 1591 1592 global curr_iter 1593 curr_iter = 1 1594 callbackp = partial(self._callback, nparas=len(init_paras), sparse_rep = sparse_rep, noise_model= noise_model) 1595 outstruct = minimize(partialobjfunc, init_paras, method='SLSQP', callback=callbackp, constraints=condict, 1596 options={'ftol': nullfunctol, 'disp': True, 'maxiter': nullmaxiter}) 1597 1598 constr_value = self._nullmodel_constr_fn(outstruct.x, sparse_rep, noise_model, constr_type) 1599 1600 if noise_model < 1: 1601 parameter_labels = ['alph_rho', 'beta', 'alpha', 'm_total', 'fmin'] 1602 d = {'label' : parameter_labels, 'value': outstruct.x} 1603 df = pd.DataFrame(data = d) 1604 elif noise_model == 1: 1605 parameter_labels = ['alph_rho', 'beta', 'alpha', 'fmin'] 1606 d = {'label' : parameter_labels, 'value': outstruct.x} 1607 df = pd.DataFrame(data = d) 1608 else: 1609 parameter_labels = ['alph_rho', 'fmin'] 1610 d = {'label' : parameter_labels, 'value': outstruct.x} 1611 df = pd.DataFrame(data = d) 1612 1613 1614 if (output_dir == None) & (filename == None): 1615 df.to_csv('nullpara' + str(noise_model)+ '.txt', sep = '\t') 1616 1617 elif (output_dir != None) & (filename == None): 1618 df.to_csv(output_dir + '/nullpara' + str(noise_model)+ '.txt', sep = '\t') 1619 1620 else : 1621 df.to_csv(output_dir + '/' + filename + '.txt', sep = '\t') 1622 1623 return outstruct, constr_value 1624 1625 def diversity_estimate(self, df, paras, noise_model): 1626 1627 """ 1628 Estimate diversity of the individual repertoire from the experimental noise learning step. 1629 Parameters 1630 ---------- 1631 df : data-frame 1632 The data-frame which has been used to learn the noise model 1633 paras : numpy array 1634 vector containing the noise parameters 1635 noise_model : int 1636 choice of noise model 1637 Returns 1638 ------- 1639 diversity_estimate 1640 float, diversity estimate from the noise model inference. 1641 1642 """ 1643 1644 sparse_rep = self.get_sparserep(df) 1645 1646 indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2, NreadsI, NreadsII = sparse_rep 1647 1648 nfbins = 1200 1649 freq_dtype = float 1650 1651 # Parameters 1652 1653 alpha = paras[0] 1654 fmin = np.power(10,paras[-1]) 1655 1656 # 1657 logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype) 1658 1659 # 1660 1661 logfvec_tmp=deepcopy(logfvec) 1662 1663 logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI,logfvec_tmp, noise_model, paras) 1664 logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII,logfvec_tmp, noise_model, paras) 1665 1666 # for the trapezoid integral methods 1667 1668 dlogfby2=np.diff(logfvec)/2 1669 1670 # Compute P(0,0) for the normalization constraint 1671 integ = np.exp(logrhofvec + logPn2_f[:, 0] + logPn1_f[:, 0] + logfvec) 1672 Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1]) 1673 1674 #print(np.sum(sparse_rep_counts)) 1675 N_obs = np.sum(sparse_rep_counts) 1676 1677 return int(N_obs/(1-Pn0n0)) 1678 1679 1680#============================================Differential expression ============================================================= 1681 1682class Expansion_Model(): 1683 1684 """ 1685 A class used to build an object associated to methods in order to select significant expanding or 1686 contracting clones from RepSeq samples taken at two different time points. 1687 ... 1688 Methods 1689 ------- 1690 get_sparserep(df) : 1691 get sparse representation of the abundances / frequencies of the TCR clones present in RepSeq samples of both time points. 1692 This changes the data input to fasten the algorithm 1693 expansion_table(outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold): 1694 generate the table of clones that have been significantly detected to be responsive to an acute stimuli. 1695 """ 1696 1697 1698 def get_sparserep(self, df): 1699 """ 1700 Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation. 1701 unicountvals_1(2) are the unique values of n1(2). 1702 sparse_rep_counts gives the counts of unique pairs. 1703 ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair. 1704 len(indn1)=len(indn2)=len(sparse_rep_counts) 1705 Parameters 1706 ---------- 1707 df : pandas data frame 1708 data-frame which is the output of the method .import_data() for one Data_Process instance. 1709 these data-frame should give the list of TCR clones present in two RepSeq samples, talen at two 1710 different time points, associated to their clone frequencies and clone abundances in the first and second replicate? 1711 Returns 1712 ------- 1713 indn1 1714 numpy array list of indexes of all values of unicountvals_1 1715 indn2 1716 numpy array list of indexes of all values of unicountvals_2 1717 sparse_rep_counts 1718 numpy array, # of clones having the read counts pair {(n1,n2)} 1719 unicountvals_1 1720 numpy array list of unique counts values present in the first sample in df[clone_count_1] 1721 unicountvals_2 1722 numpy array list of unique counts values present in the second sample in df[clone_count_2] 1723 Nreads1 1724 float, total number of counts/reads in the first sample referred in df by "_1" for first time point 1725 Nreads2 1726 float, total number of counts/reads in the second sample referred in df by "_2" for second time point 1727 """ 1728 1729 counts = df.loc[:,['Clone_count_1', 'Clone_count_2']] 1730 counts['paircount'] = 1 # gives a weight of 1 to each observed clone 1731 1732 clone_counts = counts.groupby(['Clone_count_1', 'Clone_count_2']).sum() 1733 sparse_rep_counts = np.asarray(clone_counts.values.flatten(), dtype=int) 1734 clonecountpair_vals = clone_counts.index.values 1735 indn1 = np.asarray([clonecountpair_vals[it][0] for it in range(len(sparse_rep_counts))], dtype=int) 1736 indn2 = np.asarray([clonecountpair_vals[it][1] for it in range(len(sparse_rep_counts))], dtype=int) 1737 NreadsI = np.sum(counts['Clone_count_1']) 1738 NreadsII = np.sum(counts['Clone_count_2']) 1739 1740 unicountvals_1, indn1 = np.unique(indn1, return_inverse=True) 1741 unicountvals_2, indn2 = np.unique(indn2, return_inverse=True) 1742 1743 return indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII 1744 1745 1746 1747 def _NegBinPar(self,m,v,mvec): 1748 ''' 1749 Same as NegBinParMtr, but for m and v being scalars. 1750 Assumes m>0. 1751 Output is (len(mvec),) array 1752 ''' 1753 mmax=mvec[-1] 1754 p = 1-m/v 1755 r = m*m/v/p 1756 NBvec=np.arange(mmax+1,dtype=float) 1757 NBvec[1:]=np.log((NBvec[1:]+r-1)/NBvec[1:]*p) #vectorization won't help unfortuneately here since log needs to be over array 1758 NBvec[0]=r*math.log(m/v) 1759 NBvec=np.exp(np.cumsum(NBvec)[mvec]) #save a bit here 1760 return NBvec 1761 1762 1763 def _NegBinParMtr(self,m,v,nvec): #speed up only insofar as the log and exp are called once on array instead of multiple times on rows 1764 ''' 1765 computes NegBin probabilities over the ordered (but possibly discontiguous) vector (nvec) 1766 for mean/variance combinations given by the mean (m) and variance (v) vectors. 1767 Note that m<v for negative binomial. 1768 Output is (len(m),len(nvec)) array 1769 ''' 1770 nmax=nvec[-1] 1771 p = 1-m/v 1772 r = m*m/v/p 1773 NBvec=np.arange(nmax+1,dtype=float) 1774 NBvec=np.log((NBvec+r[:,np.newaxis]-1)*(p[:,np.newaxis]/NBvec)) 1775 NBvec[:,0]=r*np.log(m/v) #handle NBvec[0]=0, treated specially when m[0]=0, see below 1776 NBvec=np.exp(np.cumsum(NBvec,axis=1)) #save a bit here 1777 if m[0]==0: 1778 NBvec[0,:]=0. 1779 NBvec[0,0]=1. 1780 NBvec=NBvec[:,nvec] 1781 return NBvec 1782 1783 def _PoisPar(self, Mvec,unicountvals): 1784 #assert Mvec[0]==0, "first element needs to be zero" 1785 nmax=unicountvals[-1] 1786 nlen=len(unicountvals) 1787 mlen=len(Mvec) 1788 Nvec=unicountvals 1789 logNvec=-np.insert(np.cumsum(np.log(np.arange(1,nmax+1))),0,0.)[unicountvals] #avoid n=0 nans 1790 Nmtr=np.exp(Nvec[np.newaxis,:]*np.log(Mvec)[:,np.newaxis]+logNvec[np.newaxis,:]-Mvec[:,np.newaxis]) # np.log(Mvec) throws warning: since log(0)=-inf 1791 if Mvec[0]==0: 1792 Nmtr[0,:]=np.zeros((nlen,)) #when m=0, n=0, and so get rid of nans from log(0) 1793 Nmtr[0,0]=1. #handled belowacq_model_type 1794 if unicountvals[0]==0: #if n=0 included get rid of nans from log(0) 1795 Nmtr[:,0]=np.exp(-Mvec) 1796 return Nmtr 1797 1798 def _get_rhof(self,alpha_rho, nfbins,fmin,freq_dtype): 1799 ''' 1800 generates power law (power is alpha_rho) clone frequency distribution over 1801 freq_nbins discrete logarithmically spaced frequences between fmin and 1 of dtype freq_dtype 1802 Outputs log probabilities obtained at log frequencies''' 1803 fmax=1e0 1804 logfvec=np.linspace(np.log10(fmin),np.log10(fmax), nfbins) 1805 logfvec=np.array(np.log(np.power(10,logfvec)) ,dtype=freq_dtype).flatten() 1806 logrhovec=logfvec*alpha_rho 1807 integ=np.exp(logrhovec+logfvec,dtype=freq_dtype) 1808 normconst=np.log(np.dot(np.diff(logfvec)/2.,integ[1:]+integ[:-1])) 1809 logrhovec-=normconst 1810 return logrhovec,logfvec 1811 1812 1813 def _get_logPn_f(self,unicounts,Nreads,logfvec, noise_model, paras): 1814 1815 """ 1816 tools to compute the likelihood of the noise model. It is not useful for the user. 1817 """ 1818 1819 # Choice of the model: 1820 1821 if noise_model<1: 1822 1823 m_total=float(np.power(10, paras[3])) 1824 r_c=Nreads/m_total 1825 if noise_model<2: 1826 1827 beta_mv= paras[1] 1828 alpha_mv=paras[2] 1829 1830 if noise_model<1: #for models that include cell counts 1831 #compute parametrized range (mean-sigma,mean+5*sigma) of m values (number of cells) conditioned on n values (reads) appearing in the data only 1832 nsigma=5. 1833 nmin=300. 1834 #for each n, get actual range of m to compute around n-dependent mean m 1835 m_low =np.zeros((len(unicounts),),dtype=int) 1836 m_high=np.zeros((len(unicounts),),dtype=int) 1837 for nit,n in enumerate(unicounts): 1838 mean_m=n/r_c 1839 dev=nsigma*np.sqrt(mean_m) 1840 m_low[nit] =int(mean_m- dev) if (mean_m>dev**2) else 0 1841 m_high[nit]=int(mean_m+5*dev) if ( n>nmin) else int(10*nmin/r_c) 1842 m_cellmax=np.max(m_high) 1843 #across n, collect all in-range m 1844 mvec_bool=np.zeros((m_cellmax+1,),dtype=bool) #cheap bool 1845 nvec=range(len(unicounts)) 1846 for nit in nvec: 1847 mvec_bool[m_low[nit]:m_high[nit]+1]=True #mask vector 1848 mvec=np.arange(m_cellmax+1)[mvec_bool] 1849 #transform to in-range index 1850 for nit in nvec: 1851 m_low[nit]=np.where(m_low[nit]==mvec)[0][0] 1852 m_high[nit]=np.where(m_high[nit]==mvec)[0][0] 1853 1854 Pn_f=np.zeros((len(logfvec),len(unicounts))) 1855 if noise_model==0: 1856 1857 mean_m=m_total*np.exp(logfvec) 1858 var_m=mean_m+beta_mv*np.power(mean_m,alpha_mv) 1859 Poisvec = self._PoisPar(mvec*r_c,unicounts) 1860 for f_it in range(len(logfvec)): 1861 NBvec=self._NegBinPar(mean_m[f_it],var_m[f_it],mvec) 1862 for n_it,n in enumerate(unicounts): 1863 Pn_f[f_it,n_it]=np.dot(NBvec[m_low[n_it]:m_high[n_it]+1],Poisvec[m_low[n_it]:m_high[n_it]+1,n_it]) 1864 1865 elif noise_model==1: 1866 1867 mean_n=Nreads*np.exp(logfvec) 1868 var_n=mean_n+beta_mv*np.power(mean_n,alpha_mv) 1869 Pn_f = self._NegBinParMtr(mean_n,var_n,unicounts) 1870 elif noise_model==2: 1871 1872 mean_n=Nreads*np.exp(logfvec) 1873 Pn_f= self._PoisPar(mean_n,unicounts) 1874 else: 1875 print('acq_model is 0,1,or 2 only') 1876 1877 return np.log(Pn_f) 1878 1879 def _get_Ps(self, alp,sbar,smax,stp): 1880 ''' 1881 generates symmetric exponential distribution over log fold change 1882 with effect size sbar and nonresponding fraction 1-alp at s=0. 1883 computed over discrete range of s from -smax to smax in steps of size stp 1884 ''' 1885 lamb=-stp/sbar 1886 smaxt=round(smax/stp) 1887 s_zeroind=int(smaxt) 1888 Z=2*(np.exp((smaxt+1)*lamb)-1)/(np.exp(lamb)-1)-1 1889 Ps=alp*np.exp(lamb*np.fabs(np.arange(-smaxt,smaxt+1)))/Z 1890 Ps[s_zeroind]+=(1-alp) 1891 return Ps 1892 1893 def _callbackFdiffexpr(self, Xi): #case dependent 1894 '''prints iteration info. called scipy.minimize''' 1895 1896 print('{0: 3.6f} {1: 3.6f} '.format(Xi[0], Xi[1])+'\n') 1897 1898 1899 def _learning_dynamics_expansion_polished(self, df, paras_1, paras_2, noise_model): 1900 """ 1901 function to infer the expansion mode parameters - not usable by the user. 1902 """ 1903 1904 indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2,NreadsI,NreadsII = self.get_sparserep(df) 1905 1906 alpha_rho = paras_1[0] 1907 fmin = np.power(10,paras_1[-1]) 1908 freq_dtype = 'float64' 1909 nfbins = 1200 #Accuracy of the integration 1910 1911 1912 logrhofvec, logfvec = get_rhof(self, alpha_rho, nfbins, fmin, freq_dtype) 1913 1914 #Definition of svec 1915 smax = 25.0 #maximum absolute logfold change value 1916 s_step = 0.1 1917 s_0 = -1 1918 1919 s_step_old= s_step 1920 logf_step= logfvec[1] - logfvec[0] #use natural log here since f2 increments in increments in exp(). 1921 f2s_step= int(round(s_step/logf_step)) #rounded number of f-steps in one s-step 1922 s_step= float(f2s_step)*logf_step 1923 smax= s_step*(smax/s_step_old) 1924 svec= s_step*np.arange(0,int(round(smax/s_step)+1)) 1925 svec= np.append(-svec[1:][::-1],svec) 1926 1927 smaxind=(len(svec)-1)/2 1928 f2s_step=int(round(s_step/logf_step)) #rounded number of f-steps in one s-step 1929 logfmin=logfvec[0 ]-f2s_step*smaxind*logf_step 1930 logfmax=logfvec[-1]+f2s_step*smaxind*logf_step 1931 1932 logfvecwide = np.linspace(logfmin,logfmax,len(logfvec)+2*smaxind*f2s_step) #a wider domain for the second frequency f2=f1*exp(s) 1933 1934 # Compute P(n1|f) and P(n2|f), each in an iteration of the following loop 1935 1936 for it in range(2): 1937 if it == 0: 1938 unicounts=unicountvals_1 1939 logfvec_tmp=deepcopy(logfvec) 1940 Nreads = NreadsI 1941 paras = paras_1 1942 else: 1943 unicounts=unicountvals_2 1944 logfvec_tmp=deepcopy(logfvecwide) #contains s-shift for sampled data method 1945 Nreads = NreadsII 1946 paras = paras_2 1947 if it == 0: 1948 logPn1_f = self._get_logPn_f( unicounts, Nreads, logfvec_tmp, noise_model, paras) 1949 1950 else: 1951 logPn2_f = self._get_logPn_f(unicounts, Nreads, logfvec_tmp, noise_model, paras) 1952 1953 #for the trapezoid method 1954 dlogfby2=np.diff(logfvec)/2 1955 1956 # Computing P(n1,n2|f,s) 1957 Pn1n2_s=np.zeros((len(svec), len(unicountvals_1), len(unicountvals_2))) 1958 1959 for s_it,s in enumerate(svec): 1960 for it,(n1_it, n2_it) in enumerate(zip(indn1,indn2)): 1961 integ = np.exp(logrhofvec+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),n2_it]+logPn1_f[:,n1_it]+ logfvec ) 1962 Pn1n2_s[s_it, n1_it, n2_it] = np.dot(dlogfby2,integ[1:] + integ[:-1]) 1963 1964 1965 Pn0n0_s = np.zeros(svec.shape) 1966 for s_it,s in enumerate(svec): 1967 integ=np.exp(logPn1_f[:,0]+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),0]+logrhofvec+logfvec) 1968 Pn0n0_s[s_it]=np.dot(dlogfby2,integ[1:]+integ[:-1]) 1969 1970 1971 N_obs = np.sum(sparse_rep_counts) 1972 print("N_obs: " + str(N_obs)) 1973 1974 1975 def cost(PARAS): 1976 1977 alp = PARAS[0] 1978 sbar = PARAS[1] 1979 1980 Ps = _get_Ps(self,alp,sbar,smax,s_step) 1981 Pn0n0=np.dot(Pn0n0_s,Ps) 1982 Pn1n2_ps=np.sum(Pn1n2_s*Ps[:,np.newaxis,np.newaxis],0) 1983 Pn1n2_ps/=1-Pn0n0 1984 print(Pn0n0) 1985 1986 1987 1988 Energy = - np.dot(sparse_rep_counts/float(N_obs),np.where(Pn1n2_ps[indn1,indn2]>0,np.log(Pn1n2_ps[indn1,indn2]),0)) 1989 1990 return Energy 1991 1992 #--------------------------Compute-the-grid----------------------------------------- 1993 1994 print('Calculation Surface : \n') 1995 st = time.time() 1996 1997 npoints = 20 #to be chosen by the user 1998 alpvec = np.logspace(-3,np.log10(0.99), npoints) 1999 sbarvec = np.linspace(0.01,5, npoints) 2000 2001 LSurface =np.zeros((len(sbarvec),len(alpvec))) 2002 for i in range(len(sbarvec)): 2003 for j in range(len(alpvec)): 2004 LSurface[i, j]= - cost([alpvec[j], sbarvec[i]]) 2005 2006 alpmesh, sbarmesh = np.meshgrid(alpvec, sbarvec) 2007 a,b = np.where(LSurface == np.max(LSurface)) 2008 print("--- %s seconds ---" % (time.time() - st)) 2009 2010 2011 #------------------------------Optimization---------------------------------------------- 2012 2013 optA = alpmesh[a[0],b[0]] 2014 optB = sbarmesh[a[0],b[0]] 2015 2016 print('polish parameter estimate from '+ str(optA)+' '+str(optB)) 2017 initparas=(optA,optB) 2018 2019 2020 outstruct = minimize(cost, initparas, method='SLSQP', callback=_callbackFdiffexpr, tol=1e-6,options={'ftol':1e-8 ,'disp': True,'maxiter':300}) 2021 2022 return outstruct.x, Pn1n2_s, Pn0n0_s, svec 2023 2024 def _learning_dynamics_expansion(self, sparse_rep, paras_1, paras_2, noise_model, display_plot=False): 2025 """ 2026 function to infer the expansion mode parameters - not usable by the user. 2027 """ 2028 2029 indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2,NreadsI,NreadsII = sparse_rep 2030 2031 alpha_rho = paras_1[0] 2032 fmin = np.power(10,paras_1[-1]) 2033 freq_dtype = 'float64' 2034 nfbins = 1200 #Accuracy of the integration 2035 2036 2037 logrhofvec, logfvec = self.get_rhof(alpha_rho, nfbins, fmin, freq_dtype) 2038 2039 #Definition of svec 2040 smax = 25.0 #maximum absolute logfold change value 2041 s_step = 0.1 2042 s_0 = -1 2043 2044 s_step_old= s_step 2045 logf_step= logfvec[1] - logfvec[0] #use natural log here since f2 increments in increments in exp(). 2046 f2s_step= int(round(s_step/logf_step)) #rounded number of f-steps in one s-step 2047 s_step= float(f2s_step)*logf_step 2048 smax= s_step*(smax/s_step_old) 2049 svec= s_step*np.arange(0,int(round(smax/s_step)+1)) 2050 svec= np.append(-svec[1:][::-1],svec) 2051 2052 smaxind=(len(svec)-1)/2 2053 f2s_step=int(round(s_step/logf_step)) #rounded number of f-steps in one s-step 2054 logfmin=logfvec[0 ]-f2s_step*smaxind*logf_step 2055 logfmax=logfvec[-1]+f2s_step*smaxind*logf_step 2056 2057 logfvecwide = np.linspace(logfmin,logfmax,int(len(logfvec)+2*smaxind*f2s_step)) #a wider domain for the second frequency f2=f1*exp(s) 2058 2059 # Compute P(n1|f) and P(n2|f), each in an iteration of the following loop 2060 2061 for it in range(2): 2062 if it == 0: 2063 unicounts=unicountvals_1 2064 logfvec_tmp=deepcopy(logfvec) 2065 Nreads = NreadsI 2066 paras = paras_1 2067 else: 2068 unicounts=unicountvals_2 2069 logfvec_tmp=deepcopy(logfvecwide) #contains s-shift for sampled data method 2070 Nreads = NreadsII 2071 paras = paras_2 2072 if it == 0: 2073 logPn1_f = self._get_logPn_f(unicounts, Nreads, logfvec_tmp, noise_model, paras) 2074 2075 else: 2076 logPn2_f = self._get_logPn_f(unicounts, Nreads, logfvec_tmp, noise_model, paras) 2077 2078 #for the trapezoid method 2079 dlogfby2=np.diff(logfvec)/2 2080 2081 # Computing P(n1,n2|f,s) 2082 Pn1n2_s=np.zeros((len(svec), len(unicountvals_1), len(unicountvals_2))) 2083 2084 for s_it,s in enumerate(svec): 2085 for it,(n1_it, n2_it) in enumerate(zip(indn1,indn2)): 2086 integ = np.exp(logrhofvec+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),n2_it]+logPn1_f[:,n1_it]+ logfvec ) 2087 Pn1n2_s[s_it, n1_it, n2_it] = np.dot(dlogfby2,integ[1:] + integ[:-1]) 2088 2089 2090 Pn0n0_s = np.zeros(svec.shape) 2091 for s_it,s in enumerate(svec): 2092 integ=np.exp(logPn1_f[:,0]+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),0]+logrhofvec+logfvec) 2093 Pn0n0_s[s_it]=np.dot(dlogfby2,integ[1:]+integ[:-1]) 2094 2095 2096 N_obs = np.sum(sparse_rep_counts) 2097 print("N_obs: " + str(N_obs)) 2098 2099 2100 def cost(PARAS): 2101 2102 alp = PARAS[0] 2103 sbar = PARAS[1] 2104 2105 Ps = self._get_Ps(alp,sbar,smax,s_step) 2106 Pn0n0=np.dot(Pn0n0_s,Ps) 2107 Pn1n2_ps=np.sum(Pn1n2_s*Ps[:,np.newaxis,np.newaxis],0) 2108 Pn1n2_ps/=1-Pn0n0 2109 #print(Pn0n0) 2110 2111 2112 2113 Energy = - np.dot(sparse_rep_counts/float(N_obs),np.where(Pn1n2_ps[indn1,indn2]>0,np.log(Pn1n2_ps[indn1,indn2]),0)) 2114 2115 return Energy 2116 2117 #--------------------------Compute-the-grid----------------------------------------- 2118 2119 print('Calculation Surface : \n') 2120 st = time.time() 2121 2122 npoints = 50 #to be chosen by the user 2123 alpvec = np.logspace(-3,np.log10(0.99), npoints) 2124 sbarvec = np.linspace(0.01,5, npoints) 2125 2126 LSurface =np.zeros((len(sbarvec),len(alpvec))) 2127 for i in range(len(sbarvec)): 2128 for j in range(len(alpvec)): 2129 LSurface[i, j]= - cost([alpvec[j], sbarvec[i]]) 2130 2131 alpmesh, sbarmesh = np.meshgrid(alpvec, sbarvec) 2132 a,b = np.where(LSurface == np.max(LSurface)) 2133 print("--- %s seconds ---" % (time.time() - st)) 2134 2135 #---------------------------Plot-the-grid------------------------------------------- 2136 if display_plot: 2137 2138 fig, ax =plt.subplots(1, figsize=(10,8)) 2139 2140 2141 a,b = np.where(LSurface == np.max(LSurface)) 2142 2143 ax.contour(alpmesh, sbarmesh, LSurface, linewidths=1, colors='k', linestyles = 'solid') 2144 plt.contourf(alpmesh, sbarmesh, LSurface, 20, cmap = 'viridis', alpha= 0.8) 2145 2146 xmax = alpmesh[a[0],b[0]] 2147 ymax = sbarmesh[a[0],b[0]] 2148 text= r"$ alpha={:.3f}, s={:.3f} $".format(xmax, ymax) 2149 bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72) 2150 arrowprops=dict(arrowstyle="->",connectionstyle="angle,angleA=0,angleB=80") 2151 kw = dict(xycoords='data',textcoords="axes fraction", 2152 arrowprops=arrowprops, bbox=bbox_props, ha="right", va="top") 2153 plt.annotate(text, xy=(xmax, ymax), xytext=(0.94,0.96), **kw) 2154 plt.xlabel(r'$ \alpha, \ size \ of \ the \ repertoire \ that \ answers \ to \ the \ vaccine $') 2155 plt.ylabel(r'$ s_{bar}, \ characteristic \ expansion \ decrease $') 2156 plt.xscale('log') 2157 plt.yscale('log') 2158 plt.grid() 2159 plt.title(r'$Grid \ Search \ graph \ for \ \alpha \ and \ s_{bar} \ parameters. $') 2160 plt.colorbar() 2161 2162 return LSurface, Pn1n2_s, Pn0n0_s, svec 2163 2164 2165 def _save_table(self, outpath, svec, Ps,Pn1n2_s, Pn0n0_s, subset, unicountvals_1_d, unicountvals_2_d, indn1_d, indn2_d, print_expanded, pthresh, smedthresh): 2166 ''' 2167 takes learned diffexpr model, Pn1n2_s*Ps, computes posteriors over (n1,n2) pairs, and writes to file a table of data with clones as rows and columns as measures of thier posteriors 2168 print_expanded=True orders table as ascending by , else descending 2169 pthresh is the threshold in 'p-value'-like (null hypo) probability, 1-P(s>0|n1_i,n2_i), where i is the row (i.e. the clone) n.b. lower null prob implies larger probability of expansion 2170 smedthresh is the threshold on the posterior median, below which clones are discarded 2171 not usable by the user. 2172 ''' 2173 2174 Psn1n2_ps=Pn1n2_s*Ps[:,np.newaxis,np.newaxis] 2175 2176 #compute marginal likelihood (neglect renormalization , since it cancels in conditional below) 2177 Pn1n2_ps=np.sum(Psn1n2_ps,0) 2178 2179 Ps_n1n2ps=Pn1n2_s*Ps[:,np.newaxis,np.newaxis]/Pn1n2_ps[np.newaxis,:,:] 2180 #compute cdf to get p-value to threshold on to reduce output size 2181 cdfPs_n1n2ps=np.cumsum(Ps_n1n2ps,0) 2182 2183 2184 def dummy(row,cdfPs_n1n2ps,unicountvals_1_d,unicountvals_2_d): 2185 ''' 2186 when applied to dataframe, generates 'p-value'-like (null hypo) probability, 1-P(s>0|n1_i,n2_i), where i is the row (i.e. the clone) 2187 ''' 2188 return cdfPs_n1n2ps[np.argmin(np.fabs(svec)),row['Clone_count_1']==unicountvals_1_d,row['Clone_count_2']==unicountvals_2_d][0] 2189 dummy_part=partial(dummy,cdfPs_n1n2ps=cdfPs_n1n2ps,unicountvals_1_d=unicountvals_1_d,unicountvals_2_d=unicountvals_2_d) 2190 2191 cdflabel=r'$1-P(s>0)$' 2192 subset[cdflabel]=subset.apply(dummy_part, axis=1) 2193 subset=subset[subset[cdflabel]<pthresh].reset_index(drop=True) 2194 2195 #go from clone count pair (n1,n2) to index in unicountvals_1_d and unicountvals_2_d 2196 data_pairs_ind_1=np.zeros((len(subset),),dtype=int) 2197 data_pairs_ind_2=np.zeros((len(subset),),dtype=int) 2198 for it in range(len(subset)): 2199 data_pairs_ind_1[it]=np.where(int(subset.iloc[it].Clone_count_1)==unicountvals_1_d)[0] 2200 data_pairs_ind_2[it]=np.where(int(subset.iloc[it].Clone_count_2)==unicountvals_2_d)[0] 2201 #posteriors over data clones 2202 Ps_n1n2ps_datpairs=Ps_n1n2ps[:,data_pairs_ind_1,data_pairs_ind_2] 2203 2204 #compute posterior metrics 2205 mean_est=np.zeros((len(subset),)) 2206 max_est= np.zeros((len(subset),)) 2207 slowvec= np.zeros((len(subset),)) 2208 smedvec= np.zeros((len(subset),)) 2209 shighvec=np.zeros((len(subset),)) 2210 pval=0.025 #double-sided comparison statistical test 2211 pvalvec=[pval,0.5,1-pval] #bound criteria defining slow, smed, and shigh, respectively 2212 for it,column in enumerate(np.transpose(Ps_n1n2ps_datpairs)): 2213 mean_est[it]=np.sum(svec*column) 2214 max_est[it]=svec[np.argmax(column)] 2215 forwardcmf=np.cumsum(column) 2216 backwardcmf=np.cumsum(column[::-1])[::-1] 2217 inds=np.where((forwardcmf[:-1]<pvalvec[0]) & (forwardcmf[1:]>=pvalvec[0]))[0] 2218 slowvec[it]=np.mean(svec[inds+np.ones((len(inds),),dtype=int)]) #use mean in case there are two values 2219 inds=np.where((forwardcmf>=pvalvec[1]) & (backwardcmf>=pvalvec[1]))[0] 2220 smedvec[it]=np.mean(svec[inds]) 2221 inds=np.where((forwardcmf[:-1]<pvalvec[2]) & (forwardcmf[1:]>=pvalvec[2]))[0] 2222 shighvec[it]=np.mean(svec[inds+np.ones((len(inds),),dtype=int)]) 2223 2224 colnames=(r'$\bar{s}$',r'$s_{max}$',r'$s_{3,high}$',r'$s_{2,med}$',r'$s_{1,low}$') 2225 for it,coldata in enumerate((mean_est,max_est,shighvec,smedvec,slowvec)): 2226 subset.insert(0,colnames[it],coldata) 2227 oldcolnames=( 'AACDR3', 'ntCDR3', 'Clone_count_1', 'Clone_count_2', 'Clone_fraction_1', 'Clone_fraction_2') 2228 newcolnames=('CDR3_AA', 'CDR3_nt', r'$n_1$', r'$n_2$', r'$f_1$', r'$f_2$') 2229 subset=subset.rename(columns=dict(zip(oldcolnames, newcolnames))) 2230 2231 #select only clones whose posterior median pass the given threshold 2232 subset=subset[subset[r'$s_{2,med}$']>smedthresh] 2233 2234 print("writing to: "+outpath) 2235 if print_expanded: 2236 subset=subset.sort_values(by=cdflabel,ascending=True) 2237 strout='expanded' 2238 else: 2239 subset=subset.sort_values(by=cdflabel,ascending=False) 2240 strout='contracted' 2241 subset.to_csv(outpath+'top_'+strout+'.csv',sep='\t',index=False) 2242 2243 2244 2245 def expansion_table(self, outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold): 2246 2247 ''' 2248 generate the table of clones that have been significantly detected to be responsive to an acute stimuli. 2249 2250 Parameters 2251 ---------- 2252 outpath : str 2253 Name of the directory where to store the output table 2254 paras_1 : numpy array 2255 parameters of the noise model that has been learned at time_1 2256 paras_2 : numpy array 2257 parameters of the noise model that has been learned at time_2 2258 df : pandas dataframe 2259 pandas dataframe merging the two RepSeq data at time_1 and time_2 2260 noise_model : int 2261 choice of noise model 0: Poisson, 1: negative Binomial, 2: negative Binomial + Poisson 2262 pval_threshold : float 2263 P-value threshold to detect and discriminate if a TCR clone has expanded 2264 smed_threshold : float 2265 median of the log-fold change threshold to detect if a TCR clone has expanded 2266 Returns 2267 ------- 2268 data-frame - csv file 2269 the output is a csv file of columns : $s_{1-low}$, $s_{2-med}$, $s_{3-high}$, $s_{max}$, $\bar{s}$, $f_1$, $f_2$, $n_1$, $n_2$, 'CDR3_nt', 'CDR3_AA' and '$p$-value' 2270 ''' 2271 2272 sparse_rep = self.get_sparserep(df) 2273 L_surface, Pn1n2_s_d, Pn0n0_s_d, svec = self._learning_dynamics_expansion(sparse_rep, paras_1, paras_2, noise_model) 2274 npoints= 50 # same as in learning_dynamics_expansion 2275 smax = 25.0 2276 s_step = 0.1 2277 alpvec = np.logspace(-3,np.log10(0.99), npoints) 2278 sbarvec = np.linspace(0.01,5, npoints) 2279 maxinds=np.unravel_index(np.argmax(L_surface),np.shape(L_surface)) 2280 optsbar=sbarvec[maxinds[0]] 2281 optalp=alpvec[maxinds[1]] 2282 optPs= self._get_Ps(optalp,optsbar,smax,s_step) 2283 pval_expanded = True 2284 2285 indn1,indn2,sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII = sparse_rep 2286 2287 self._save_table(outpath, svec, optPs, Pn1n2_s_d, Pn0n0_s_d, df, unicountvals_1, unicountvals_2, indn1, indn2, pval_expanded, pval_threshold, smed_threshold) 2288 2289 2290#============================================Generate Synthetic Data ============================================================= 2291 2292class Generator: 2293 2294 """ 2295 A class used to build an object to generate in-Silico (synthetic) RepSeq samples, in the case of replicates at 2296 the same day and in the case of having 2 samples generated at an initial time for the first one and some time after (months, years) 2297 for the second one using the geometric Brownian motion model decribed in https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 2298 ... 2299 Methods 2300 ------- 2301 gen_synthetic_data_Null(paras, noise_model, NreadsI,NreadsII,Nsamp): 2302 generate in-silico same day RepSeq replicates. 2303 generate_trajectories(tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI = '1e6', NreadsII = '1e6'): 2304 generate in-silico t_ime apart RepSeq samples. 2305 """ 2306 2307 def _get_rhof(self, alpha_rho, fmin, freq_nbins=800, freq_dtype='float64'): 2308 2309 ''' 2310 generates power law (power is alpha_rho) clone frequency distribution over 2311 freq_nbins discrete logarithmically spaced frequences between fmin and 1 of dtype freq_dtype 2312 Outputs log probabilities obtained at log frequencies''' 2313 fmax=1e0 2314 logfvec=np.linspace(np.log10(fmin),np.log10(fmax),freq_nbins) 2315 logfvec=np.array(np.log(np.power(10,logfvec)) ,dtype=freq_dtype).flatten() 2316 logrhovec=logfvec*alpha_rho 2317 integ=np.exp(logrhovec+logfvec,dtype=freq_dtype) 2318 normconst=np.log(np.dot(np.diff(logfvec)/2.,integ[1:]+integ[:-1])) 2319 logrhovec-=normconst 2320 return logrhovec,logfvec 2321 2322 def _get_distsample(self, pmf,Nsamp,dtype='uint32'): 2323 ''' 2324 generates Nsamp index samples of dtype (e.g. uint16 handles up to 65535 indices) from discrete probability mass function pmf. 2325 Handles multi-dimensional domain. N.B. Output is sorted. 2326 ''' 2327 #assert np.sum(pmf)==1, "cmf not normalized!" 2328 2329 shape = np.shape(pmf) 2330 sortindex = np.argsort(pmf, axis=None)#uses flattened array 2331 pmf = pmf.flatten() 2332 pmf = pmf[sortindex] 2333 cmf = np.cumsum(pmf) 2334 choice = np.random.uniform(high = cmf[-1], size = int(float(Nsamp))) 2335 index = np.searchsorted(cmf, choice) 2336 index = sortindex[index] 2337 index = np.unravel_index(index, shape) 2338 index = np.transpose(np.vstack(index)) 2339 sampled_inds = np.array(index[np.argsort(index[:,0])],dtype=dtype) 2340 return sampled_inds 2341 2342 2343 def gen_synthetic_data_Null(self, paras, noise_model, NreadsI,NreadsII,Nsamp): 2344 ''' 2345 outputs an array of observed clone frequencies and corresponding dataframe of pair counts 2346 for a null model learned from a dataset pair with NreadsI and NreadsII number of reads, respectively. 2347 Crucial for RAM efficiency, sampling is conditioned on being observed in each of the three (n,0), (0,n'), and n,n'>0 conditions 2348 so that only Nsamp clones need to be sampled, rather than the N clones in the repertoire. 2349 Note that no explicit normalization is applied. It is assumed that the values in paras are consistent with N<f>=1 2350 (e.g. were obtained through the learning done in this package). 2351 ''' 2352 2353 2354 alpha = paras[0] #power law exponent 2355 fmin=np.power(10,paras[-1]) 2356 if noise_model<1: 2357 m_total=float(np.power(10, paras[3])) 2358 r_c1=NreadsI/m_total 2359 r_c2=NreadsII/m_total 2360 r_cvec=[r_c1,r_c2] 2361 if noise_model<2: 2362 beta_mv= paras[1] 2363 alpha_mv=paras[2] 2364 2365 logrhofvec,logfvec = self.get_rhof(alpha,fmin) 2366 fvec=np.exp(logfvec) 2367 dlogf=np.diff(logfvec)/2. 2368 2369 #generate measurement model distribution, Pn_f 2370 Pn_f=np.empty((len(logfvec),),dtype=object) #len(logfvec) samplers 2371 2372 #get value at n=0 to use for conditioning on n>0 (and get full Pn_f here if noise_model=1,2) 2373 m_max=1e3 #conditioned on n=0, so no edge effects 2374 2375 Nreadsvec=(NreadsI,NreadsII) 2376 for it in range(2): 2377 Pn_f=np.empty((len(fvec),),dtype=object) 2378 if noise_model==2: 2379 m1vec=Nreadsvec[it]*fvec 2380 for find,m1 in enumerate(m1vec): 2381 Pn_f[find]=poisson(m1) 2382 logPn0_f=-m1vec 2383 elif noise_model==1: 2384 m1=Nreadsvec[it]*fvec 2385 v1=m1+beta_mv*np.power(m1,alpha_mv) 2386 p=1-m1/v1 2387 n=m1*m1/v1/p 2388 for find,(n,p) in enumerate(zip(n,p)): 2389 Pn_f[find]=nbinom(n,1-p) 2390 Pn0_f=np.asarray([Pn_find.pmf(0) for Pn_find in Pn_f]) 2391 logPn0_f=np.log(Pn0_f) 2392 2393 elif noise_model==0: 2394 m1=m_total*fvec 2395 v1=m1+beta_mv*np.power(m1,alpha_mv) 2396 p=1-m1/v1 2397 n=m1*m1/v1/p 2398 Pn0_f=np.zeros((len(fvec),)) 2399 for find in range(len(Pn0_f)): 2400 nbtmp=nbinom(n[find],1-p[find]).pmf(np.arange(m_max+1)) 2401 ptmp=poisson(r_cvec[it]*np.arange(m_max+1)).pmf(0) 2402 Pn0_f[find]=np.sum(np.exp(np.log(nbtmp)+np.log(ptmp))) 2403 logPn0_f=np.log(Pn0_f) 2404 else: 2405 print('acq_model is 0,1,or 2 only') 2406 2407 if it==0: 2408 Pn1_f=Pn_f 2409 logPn10_f=logPn0_f 2410 else: 2411 Pn2_f=Pn_f 2412 logPn20_f=logPn0_f 2413 2414 #3-quadrant q|f conditional distribution (qx0:n1>0,n2=0;q0x:n1=0,n2>0;qxx:n1,n2>0) 2415 logPqx0_f=np.log(1-np.exp(logPn10_f))+logPn20_f 2416 logPq0x_f=logPn10_f+np.log(1-np.exp(logPn20_f)) 2417 logPqxx_f=np.log(1-np.exp(logPn10_f))+np.log(1-np.exp(logPn20_f)) 2418 #3-quadrant q,f joint distribution 2419 logPfqx0=logPqx0_f+logrhofvec 2420 logPfq0x=logPq0x_f+logrhofvec 2421 logPfqxx=logPqxx_f+logrhofvec 2422 #3-quadrant q marginal distribution 2423 Pqx0=np.trapz(np.exp(logPfqx0+logfvec),x=logfvec) 2424 Pq0x=np.trapz(np.exp(logPfq0x+logfvec),x=logfvec) 2425 Pqxx=np.trapz(np.exp(logPfqxx+logfvec),x=logfvec) 2426 2427 #3 quadrant conditional f|q distribution 2428 Pf_qx0=np.where(Pqx0>0,np.exp(logPfqx0-np.log(Pqx0)),0) 2429 Pf_q0x=np.where(Pq0x>0,np.exp(logPfq0x-np.log(Pq0x)),0) 2430 Pf_qxx=np.where(Pqxx>0,np.exp(logPfqxx-np.log(Pqxx)),0) 2431 2432 #3-quadrant q marginal distribution 2433 newPqZ=Pqx0 + Pq0x + Pqxx 2434 Pqx0/=newPqZ 2435 Pq0x/=newPqZ 2436 Pqxx/=newPqZ 2437 2438 Pfqx0=np.exp(logPfqx0) 2439 Pfq0x=np.exp(logPfq0x) 2440 Pfqxx=np.exp(logPfqxx) 2441 2442 print('Model probs: '+str(Pqx0)+' '+str(Pq0x)+' '+str(Pqxx)) 2443 2444 #get samples 2445 num_samples=Nsamp 2446 q_samples=np.random.choice(range(3), num_samples, p=(Pqx0,Pq0x,Pqxx)) 2447 vals,counts=np.unique(q_samples,return_counts=True) 2448 num_qx0=counts[0] 2449 num_q0x=counts[1] 2450 num_qxx=counts[2] 2451 print('q samples: '+str(sum(counts))+' '+str(num_qx0)+' '+str(num_q0x)+' '+str(num_qxx)) 2452 print('q sampled probs: '+str(num_qx0/float(sum(counts)))+' '+str(num_q0x/float(sum(counts)))+' '+str(num_qxx/float(sum(counts)))) 2453 2454 #x0 2455 integ=np.exp(np.log(Pf_qx0)+logfvec) 2456 f_samples_inds= self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_qx0).flatten() 2457 f_sorted_inds=np.argsort(f_samples_inds) 2458 f_samples_inds=f_samples_inds[f_sorted_inds] 2459 qx0_f_samples=fvec[f_samples_inds] 2460 find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True) 2461 qx0_samples=np.zeros((num_qx0,)) 2462 if noise_model<1: 2463 qx0_m_samples=np.zeros((num_qx0,)) 2464 #conditioning on n>0 applies an m-dependent factor to Pm_f, which can't be incorporated into the ppf method used for noise_model 1 and 2. 2465 #We handle that here by using a custom finite range sampler, which has the drawback of having to define an upper limit. 2466 #This works so long as n_max/r_c<<m_max, so depends on highest counts in data (n_max). My data had max counts of 1e3-1e4. 2467 #Alternatively, could define a custom scipy RV class by defining it's PMF, but has to be array-compatible which requires care. 2468 m_samp_max=int(1e5) 2469 mvec=np.arange(m_samp_max) 2470 2471 for it,find in enumerate(find_vals): 2472 if noise_model==0: 2473 m1=m_total*fvec[find] 2474 v1=m1+beta_mv*np.power(m1,alpha_mv) 2475 p=1-m1/v1 2476 n=m1*m1/v1/p 2477 Pm1_f=nbinom(n,1-p) 2478 2479 Pm1_f_adj=np.exp(np.log(1-np.exp(-r_c1*mvec))+np.log(Pm1_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c1+np.log(1-p))/(np.exp(r_c1)-p),n)))) #adds m-dependent factor due to conditioning on n>0... 2480 Pm1_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm1_f_adj/np.sum(Pm1_f_adj))) 2481 qx0_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm1_f_adj_obj.rvs(size=f_counts[it]) 2482 2483 mvals,minds,m_counts=np.unique(qx0_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True) 2484 for mit,m in enumerate(mvals): 2485 Pn1_m1=poisson(r_c1*m) 2486 samples=np.random.random(size=m_counts[mit]) * (1-Pn1_m1.cdf(0)) + Pn1_m1.cdf(0) 2487 qx0_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn1_m1.ppf(samples) 2488 2489 2490 elif noise_model>0: 2491 samples=np.random.random(size=f_counts[it]) * (1-Pn1_f[find].cdf(0)) + Pn1_f[find].cdf(0) 2492 qx0_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn1_f[find].ppf(samples) 2493 else: 2494 print('acq_model is 0,1, or 2 only') 2495 qx0_pair_samples=np.hstack((qx0_samples[:,np.newaxis],np.zeros((num_qx0,1)))) 2496 2497 #0x 2498 integ=np.exp(np.log(Pf_q0x)+logfvec) 2499 f_samples_inds=self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_q0x).flatten() 2500 f_sorted_inds=np.argsort(f_samples_inds) 2501 f_samples_inds=f_samples_inds[f_sorted_inds] 2502 q0x_f_samples=fvec[f_samples_inds] 2503 find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True) 2504 q0x_samples=np.zeros((num_q0x,)) 2505 if noise_model<1: 2506 q0x_m_samples=np.zeros((num_q0x,)) 2507 for it,find in enumerate(find_vals): 2508 if noise_model==0: 2509 m2=m_total*fvec[find] 2510 v2=m2+beta_mv*np.power(m2,alpha_mv) 2511 p=1-m2/v2 2512 n=m2*m2/v2/p 2513 Pm2_f=nbinom(n,1-p) 2514 2515 Pm2_f_adj=np.exp(np.log(1-np.exp(-r_c2*mvec))+np.log(Pm2_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c2+np.log(1-p))/(np.exp(r_c2)-p),n)))) #adds m-dependent factor due to conditioning on n>0... 2516 Pm2_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm2_f_adj/np.sum(Pm2_f_adj))) 2517 q0x_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm2_f_adj_obj.rvs(size=f_counts[it]) 2518 2519 mvals,minds,m_counts=np.unique(q0x_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True) 2520 for mit,m in enumerate(mvals): 2521 Pn2_m2=poisson(r_c2*m) 2522 samples=np.random.random(size=m_counts[mit]) * (1-Pn2_m2.cdf(0)) + Pn2_m2.cdf(0) 2523 q0x_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn2_m2.ppf(samples) 2524 2525 2526 2527 elif noise_model > 0: 2528 samples=np.random.random(size=f_counts[it]) * (1-Pn2_f[find].cdf(0)) + Pn2_f[find].cdf(0) 2529 q0x_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn2_f[find].ppf(samples) 2530 else: 2531 print('acq_model is 0,1,or 2 only') 2532 q0x_pair_samples=np.hstack((np.zeros((num_q0x,1)),q0x_samples[:,np.newaxis])) 2533 2534 #qxx 2535 integ=np.exp(np.log(Pf_qxx)+logfvec) 2536 f_samples_inds=self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_qxx).flatten() 2537 f_sorted_inds=np.argsort(f_samples_inds) 2538 f_samples_inds=f_samples_inds[f_sorted_inds] 2539 qxx_f_samples=fvec[f_samples_inds] 2540 find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True) 2541 qxx_n1_samples=np.zeros((num_qxx,)) 2542 qxx_n2_samples=np.zeros((num_qxx,)) 2543 if noise_model<1: 2544 qxx_m1_samples=np.zeros((num_qxx,)) 2545 qxx_m2_samples=np.zeros((num_qxx,)) 2546 for it,find in enumerate(find_vals): 2547 if noise_model==0: 2548 m1=m_total*fvec[find] 2549 v1=m1+beta_mv*np.power(m1,alpha_mv) 2550 p=1-m1/v1 2551 n=m1*m1/v1/p 2552 Pm1_f=nbinom(n,1-p) 2553 2554 Pm1_f_adj=np.exp(np.log(1-np.exp(-r_c1*mvec))+np.log(Pm1_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c1+np.log(1-p))/(np.exp(r_c1)-p),n)))) #adds m-dependent factor due to conditioning on n>0... 2555 if np.sum(Pm1_f_adj)==0: 2556 qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=1 2557 else: 2558 Pm1_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm1_f_adj/np.sum(Pm1_f_adj))) 2559 qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm1_f_adj_obj.rvs(size=f_counts[it]) 2560 2561 mvals,minds,m_counts=np.unique(qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True) 2562 for mit,m in enumerate(mvals): 2563 Pn1_m1=poisson(r_c1*m) 2564 samples=np.random.random(size=m_counts[mit]) * (1-Pn1_m1.cdf(0)) + Pn1_m1.cdf(0) 2565 qxx_n1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn1_m1.ppf(samples) 2566 2567 m2=m_total*fvec[find] 2568 v2=m2+beta_mv*np.power(m2,alpha_mv) 2569 p=1-m2/v2 2570 n=m2*m2/v2/p 2571 Pm2_f=nbinom(n,1-p) 2572 2573 Pm2_f_adj=np.exp(np.log(1-np.exp(-r_c2*mvec))+np.log(Pm2_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c2+np.log(1-p))/(np.exp(r_c2)-p),n)))) #adds m-dependent factor due to conditioning on n>0... 2574 if np.sum(Pm1_f_adj)==0: 2575 qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=1 2576 else: 2577 Pm2_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm2_f_adj/np.sum(Pm2_f_adj))) 2578 qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm2_f_adj_obj.rvs(size=f_counts[it]) 2579 2580 mvals,minds,m_counts=np.unique(qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True) 2581 for mit,m in enumerate(mvals): 2582 Pn2_m2=poisson(r_c2*m) 2583 samples=np.random.random(size=m_counts[mit]) * (1-Pn2_m2.cdf(0)) + Pn2_m2.cdf(0) 2584 qxx_n2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn2_m2.ppf(samples) 2585 2586 2587 elif noise_model>0: 2588 samples=np.random.random(size=f_counts[it]) * (1-Pn1_f[find].cdf(0)) + Pn1_f[find].cdf(0) 2589 qxx_n1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn1_f[find].ppf(samples) 2590 samples=np.random.random(size=f_counts[it]) * (1-Pn2_f[find].cdf(0)) + Pn2_f[find].cdf(0) 2591 qxx_n2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn2_f[find].ppf(samples) 2592 else: 2593 print('acq_model is 0,1, or 2 only') 2594 2595 qxx_pair_samples=np.hstack((qxx_n1_samples[:,np.newaxis],qxx_n2_samples[:,np.newaxis])) 2596 2597 pair_samples=np.vstack((q0x_pair_samples,qx0_pair_samples,qxx_pair_samples)) 2598 f_samples=np.concatenate((q0x_f_samples,qx0_f_samples,qxx_f_samples)) 2599 output_m_samples=False 2600 if noise_model<1 and output_m_samples: 2601 m1_samples=np.concatenate((q0x_m1_samples,qx0_m1_samples,qxx_m1_samples)) 2602 m2_samples=np.concatenate((q0x_m2_samples,qx0_m2_samples,qxx_m2_samples)) 2603 2604 pair_samples_df=pd.DataFrame({'Clone_count_1':pair_samples[:,0],'Clone_count_2':pair_samples[:,1]}) 2605 2606 pair_samples_df['Clone_fraction_1'] = pair_samples_df['Clone_count_1']/np.sum(pair_samples_df['Clone_count_1']) 2607 pair_samples_df['Clone_fraction_2'] = pair_samples_df['Clone_count_2']/np.sum(pair_samples_df['Clone_count_2']) 2608 2609 return f_samples,pair_samples_df 2610 2611 2612 def generate_trajectories(self, tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI = '1e6', NreadsII = '1e6'): 2613 2614 2615 """ 2616 generate in-silico t_ime apart RepSeq samples. 2617 2618 Parameters 2619 ---------- 2620 paras_1 : numpy array 2621 parameters of the noise model that has been learnt at time_1 2622 paras_2 : numpy array 2623 parameters of the noise model that has been learnt at time_2 2624 method : str 2625 'negative_binomial' or 'poisson' 2626 tau : float 2627 first time-scale parameter of the dynamics 2628 theta : float 2629 second time-scale parameter of the dynamics 2630 t_ime : float 2631 number of years between both synthetic sampling (between time_1 and time_2) 2632 filename : str 2633 name of the file in which the dataframe is stored 2634 Returns 2635 ------- 2636 data-frame - csv file 2637 the output is a csv file of columns : 'Clone_count_1' (at time_1) 'Clone_count_2' (at time_2) and the frequency counterparts 'Clone_frequency_1' and 'Clone_frequency_2' 2638 """ 2639 2640 np.seterr(divide = 'ignore') 2641 np.warnings.filterwarnings('ignore') 2642 2643 method = 'negative_binomial' 2644 2645 2646 # Synthetic data generation 2647 2648 print('execution starting...') 2649 2650 st = time.time() 2651 2652 #Values of the parameters 2653 A = -1/tau 2654 B = 1/theta 2655 N_0 = 40 2656 NreadsI = float(NreadsI) 2657 NreadsII = float(NreadsII) 2658 2659 t = float(t_ime) 2660 2661 if NreadsI == NreadsII: 2662 key_sym = '_sym_' 2663 2664 else: 2665 key_sym = '_asym_' 2666 2667 # Name of the directory 2668 2669 2670 dirName = 'output' 2671 os.makedirs(dirName, exist_ok=True) 2672 2673 paras = paras_1 #Just put a and b of the negative binomiale distribution [0.7, 1.1] 2674 alpha = -1 +2*A/B 2675 #print('alpha : ' + str(alpha)) 2676 2677 #1/ Generate log-population at initial time from steady-state distribution + GBM diffusion trajectories for 2 years 2678 x_i_LB, x_f_LB, Prop_Matrix_LB, p_ext_LB, results_extinction_LB, time_vec_LB, results_extinction_source_LB, x_source_LB = _generator_diffusion_LB(A, B, N_0, t) 2679 2680 #x_i_LB, x_f_LB, Prop_Matrix, p_ext, results_extinction = generator_diffusion_LB(B, A, N_0, t) 2681 N_cells_day_0_LB, N_cells_day_1_LB = np.sum(np.exp(x_i_LB)), np.sum(np.exp(x_f_LB)) + np.sum(np.exp(x_source_LB)) #N_cells_final_LB 2682 print('NUMBER OF CELLS AT INITIAL TIME') 2683 print(N_cells_day_0_LB) 2684 2685 print('NUMBER OF CELLS AT FINAL TIME') 2686 print(N_cells_day_1_LB) 2687 2688 #print('SHAPE_X_I ' + str(np.shape(x_i_LB))) 2689 #print('SHAPE_X_F ' + str(np.shape(x_f_LB))) 2690 2691 2692 if method == 'negative_binomial': 2693 2694 df_diffusion_LB = _experimental_sampling_diffusion_NegBin(NreadsI, NreadsII, paras, x_i_LB, x_f_LB, N_cells_day_0_LB, N_cells_day_1_LB) 2695 df_diffusion_LB.to_csv(filename + '.csv' , sep= '\t') 2696 2697 elif method == 'poisson': 2698 2699 df_diffusion_LB = _experimental_sampling_diffusion_Poisson(NreadsI, NreadsII, x_i_LB, x_f_LB, t, N_cells_day_0_LB, N_cells_day_1_LB) 2700 df_diffusion_LB.to_csv(filename + '.csv' , sep= '\t') 2701 2702
692class longitudinal_analysis(): 693 694 """ 695 This class provides some tool to inspect and compute some simple statistics on longitudinal data associated with 696 one individual (it is independent of the NoisET software). 697 698 ... 699 Attributes 700 ---------- 701 clone_count_label : str 702 label in the clonotype tables indicating the clonotype count 703 seq_label : str 704 label in the clonotype tables indicating the sequence of the receptor 705 clones : dict of pandas.DataFrame 706 dictionary containing the clonotype tables as pandas frames. The keys are 707 strings "patient_time", replicated are merged. Created in the initalization 708 times : list of float 709 ordered times of the imported tables. Created in the initialization 710 unique_clones : list of str 711 list of all the unique clonotype sequences in all the time points 712 time_occurrence : list of int 713 number of time points in which each clonotype appears. The index 714 refers to the clonotype in the unique_clones list 715 Methods 716 ------- 717 compute_clone_time_occurrence() 718 It creates two new attribues: the list of uniqe clonotypes in all the dataset 719 "self.unique_clones" and the time occurrence of each of them "self.time_occurrence". 720 the time occurrence is the number of time points in which the clone appears. 721 plot_hist_persistence(figsize=(12,10)) 722 It plots the distribution of time occurrence of the unique clonotypes 723 top_clones_set(n_top_clones) 724 Compute the set of top clones as the union of the "n_top_clones" most abundant 725 clonotype in each time point 726 build_traj_frame(top_clones_set) 727 Compute the set of top clones as the union of the "n_top_clones" most abundant 728 clonotype in each time point 729 plot_trajectories(n_top_clones, colormap='viridis', figsize=(12,10)) 730 Function to plot the trajectories of the first "n_top_clones". Colors of the 731 trajectories represent the cumulative frequency in all the time points. 732 PCA_traj(n_top_clones, nclus=4) 733 Perform PCA over the normalized trajectories of n_top_clones TCR clones. 734 The normalization consists in dividing the whole trajectory by its maximum value. 735 After PCA the trajectories are clustered in the two principal componets space 736 with a hierarchical clustering algorithm. 737 plot_PCA2(n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)) 738 Plotting the trajectories in the space of their two principal components and 739 clustering them as in "PCA_traj". 740 plot_PCA_clusters_traj(n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)) 741 Plotting the trajectories grouped by PCA clusters 742 """ 743 744 745 746 747 def __init__(self, patient, data_folder, sequence_label='N. Seq. CDR3', clone_count_label='Clone count', 748 replicate1_label='_F1', replicate2_label='_F2', separator='\t'): 749 """ 750 Import all the clonotypes of a given patient and store them in the dictionary "self.clones". 751 It also creates the list of times "self.times". During this process the replicates at the 752 same time points are merged together. 753 The names of the tables containing TCR should be structured as "patient_time_replicate.csv". 754 Those tables should be cvs files compressed in a zip archive (see the example notebook). 755 Parameters 756 ---------- 757 patient : str 758 The ID of the patient 759 data_folder : str 760 folder name containing the csv files listing the T-cell receptors 761 separator : str 762 separator symbol in the csv tables 763 """ 764 765 self.clone_count_label = clone_count_label 766 self.seq_label = sequence_label 767 self.unique_clones = None 768 self.time_occurrence = None 769 self.times = [] 770 clones_repl = dict() 771 772 # Iteration over all the file in the folder for importing each table 773 for file_name in os.listdir(data_folder): 774 # If the name before the underscore corresponds to the chosen patient.. 775 if file_name.split('_')[0] == patient: 776 # Import the table 777 frame = pd.read_csv(data_folder+file_name, sep='\t', compression=dict(method='zip')) 778 # Store it in a dictionary where the key contains the patient, the time 779 # and the replicate. 780 clones_repl[file_name[:-10]] = frame 781 # Reading the time from the name and storing it 782 self.times.append(int(file_name.split('_')[1])) 783 print('Clonotypes',file_name[:-10],'imported') 784 785 # Sorting the unique times 786 self.times = np.sort(list(set(self.times))) 787 self.clones = self._merge_replicates(patient, clones_repl, replicate1_label, replicate2_label) 788 789 790 def _merge_replicates(self, patient, clones_repl, repl1_label, repl2_label): 791 792 clones_merged = dict() 793 794 # Iteration over the times 795 for it, t in enumerate(self.times): 796 # Building the ids correponding at 1st and 2nd replicate at given time point 797 id_F1 = patient + '_' + str(t) + repl1_label 798 id_F2 = patient + '_' + str(t) + repl2_label 799 # Below all the rows of one table are appended to the rows of the other 800 merged_replicates = clones_repl[id_F1].merge(clones_repl[id_F2], how='outer') 801 # But there are common clonotypes that now appear in two different rows 802 # (one for the first and one for the second replicate)! 803 # Below we collapse those common sequences and the counts of the two are summed 804 merged_replicates = merged_replicates.groupby(self.seq_label, as_index=False).agg({self.clone_count_label:sum}) 805 depth = merged_replicates[self.clone_count_label].sum() 806 merged_replicates['Clone freq'] = merged_replicates[self.clone_count_label] / depth 807 merged_replicates = merged_replicates.sort_values('Clone freq', ascending=False) 808 # The merged table is then added to the dictionary 809 clones_merged[patient + '_' + str(t)] = merged_replicates 810 811 return clones_merged 812 813 814 def compute_clone_time_occurrence(self): 815 816 """ 817 It creates two new attribues: the list of uniqe clonotypes in all the dataset 818 "self.unique_clones" and the time occurrence of each of them "self.time_occurrence". 819 the time occurrence is the number of time points in which the clone appears. 820 """ 821 822 all_clones = np.array([]) 823 for id_, cl in self.clones.items(): 824 all_clones = np.append(all_clones, cl[self.seq_label].values) 825 826 # The following function returns the list of unique clonotypes and the number of 827 # repetitions for each of them. 828 # Note that the number of repetitions is exactly the time occurrence 829 self.unique_clones, self.time_occurrence = np.unique(all_clones, return_counts=True) 830 831 832 def plot_hist_persistence(self, figsize=(12,10)): 833 834 """ 835 It plots the distribution of time occurrence of the unique clonotypes 836 Parameters 837 ---------- 838 figsize : tuple 839 width, height in inches 840 841 Returns 842 ------- 843 ax : matplotlib.axes._subplots.AxesSubplot 844 axes where to draw the plot 845 fig : matplotlib.figure.Figure 846 matplotlib figure 847 """ 848 849 if type(self.unique_clones) != np.ndarray: 850 self.compute_clone_time_occurrence() 851 852 fig, ax = plt.subplots(figsize=figsize) 853 854 plt.rc('xtick', labelsize = 30) 855 plt.rc('ytick', labelsize = 30) 856 857 ax.set_yscale('log') 858 ax.set_xlabel('Time occurrence', fontsize = 30) 859 ax.set_ylabel('Counts', fontsize = 30) 860 ax.hist(self.time_occurrence, bins=np.arange(1,len(self.times)+2)-0.5, rwidth=0.6) 861 862 return fig, ax 863 864 865 def top_clones_set(self, n_top_clones): 866 867 """ 868 Compute the set of top clones as the union of the "n_top_clones" most abundant 869 clonotype in each time point 870 Parameters 871 ---------- 872 n_top_clones : int 873 number of most abundant clontypes in each time point 874 Returns 875 ------- 876 top_clones : set of str 877 set of top clones 878 """ 879 880 top_clones = set() 881 for id_, cl in self.clones.items(): 882 top_clones_at_time = cl.sort_values(self.clone_count_label, ascending=False)[:n_top_clones] 883 top_clones = top_clones.union(top_clones_at_time[self.seq_label].values) 884 return top_clones 885 886 887 def build_traj_frame(self, clone_set): 888 889 """ 890 This builds a dataframe containing the frequency at all the time points for each 891 of the clonotypes specified in clone_set. 892 The dataframe has also a field that contains the cumulative frequency. 893 Parameters 894 ---------- 895 clones_set : iterable of str 896 list of clonotypes whose temporal trajectory is drawn 897 Returns 898 ------- 899 traj_frame : pandas.DataFrame 900 dataframe containing the frequency at all the time points 901 """ 902 903 traj_frame = pd.DataFrame(index=clone_set) 904 traj_frame['Clone cumul freq'] = 0 905 906 for id_, cl in self.clones.items(): 907 908 # Getting the time from the index of clones_merged 909 t = id_.split('_')[1] 910 # Selecting the clonotypes that are both in the frame at the given time 911 # point and in the list of top_clones_set 912 top_clones_at_time = clone_set.intersection(set(cl[self.seq_label])) 913 # Creating a sub-dataframe containing only the clone in top_clones_at_time 914 clones_at_time = cl.set_index(self.seq_label).loc[top_clones_at_time] 915 # Creating a new column in the trajectory frames for the counts at that time 916 traj_frame['t'+str(t)] = traj_frame.index.map(clones_at_time['Clone freq'].to_dict()) 917 # The clonotypes not present at that time are NaN. Below we convert NaN in 0s 918 traj_frame = traj_frame.fillna(0) 919 # The cumulative count for each clonotype is updated 920 traj_frame['Clone cumul freq'] += traj_frame['t'+str(t)] 921 922 return traj_frame 923 924 925 926 # Plot clonal trajectories 927 928 929 def plot_trajectories(self, n_top_clones, colormap='viridis', figsize=(12,10)): 930 931 """ 932 Function to plot the trajectories of the first "n_top_clones". Colors of the 933 trajectories represent the cumulative frequency in all the time points. 934 935 Parameters 936 ---------- 937 n_top_clones : int 938 number of most abundant clontypes in each time point 939 colormap : str 940 colors of the trajectories 941 942 figsize : tuple 943 width, height in inches 944 Returns 945 ------- 946 ax : matplotlib.axes._subplots.AxesSubplot 947 axes where to draw the plot 948 fig : matplotlib.figure.Figure 949 matplotlib figure 950 """ 951 952 cmap = cm.get_cmap(colormap) 953 top_clones = self.top_clones_set(n_top_clones) 954 traj_frame = self.build_traj_frame(top_clones) 955 956 fig, ax = plt.subplots(figsize=figsize) 957 plt.rc('xtick', labelsize = 30) 958 plt.rc('ytick', labelsize = 30) 959 ax.set_yscale('log') 960 ax.set_xlabel('time', fontsize = 25) 961 ax.set_ylabel('frequency', fontsize = 25) 962 963 log_counts = np.log10(traj_frame['Clone cumul freq'].values) 964 max_log_count = max(log_counts) 965 min_log_count = min(log_counts) 966 967 for id_, row in traj_frame.iterrows(): 968 traj = row.drop(['Clone cumul freq']).to_numpy() 969 log_count = np.log10(row['Clone cumul freq']) 970 norm_log_count = (log_count-min_log_count)/(max_log_count-min_log_count) 971 plt.plot(self.times, traj, c=cmap(norm_log_count)) 972 973 974 sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=min(log_counts), vmax=max(log_counts))) 975 cb = plt.colorbar(sm) 976 cb.set_label('Log10 cumulative frequency', fontsize = 25) 977 978 return fig, ax 979 980 981 def PCA_traj(self, n_top_clones, nclus=4): 982 983 """ 984 Perform PCA over the normalized trajectories of n_top_clones TCR clones. 985 The normalization consists in dividing the whole trajectory by its maximum value. 986 After PCA the trajectories are clustered in the two principal componets space 987 with a hierarchical clustering algorithm. 988 989 Parameters 990 ---------- 991 n_top_clones : int 992 number of most abundant clontypes in each time point to consider in the PCA 993 nclus : float 994 number of clusters 995 996 Returns 997 ------- 998 pca : sklearn.decomposition._pca.PCA 999 object containing the result of the principal component analysis 1000 1001 clustering : sklearn.cluster._agglomerative.AgglomerativeClustering 1002 object containing the result of the hierarchical clustering 1003 """ 1004 1005 #Getting the top n_top_clones clonotypes at each time point 1006 top_clones = self.top_clones_set(n_top_clones) 1007 #Building a trajectory dataframe 1008 traj_frame = self.build_traj_frame(top_clones) 1009 1010 #Converting it in a numpy matrix 1011 traj_matrix = traj_frame.drop(['Clone cumul freq'], axis = 1).to_numpy() 1012 1013 # Normalize each trajectory by its maximum 1014 norm_traj_matrix = traj_matrix/np.max(traj_matrix, axis=1)[:, np.newaxis] 1015 1016 pca = PCA(n_components =2).fit(norm_traj_matrix.T) 1017 clustering = AgglomerativeClustering(n_clusters = nclus) 1018 clustering = clustering.fit(pca.components_.T) 1019 1020 return pca, clustering 1021 1022 1023 def plot_PCA2(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)): 1024 1025 """ 1026 Plotting the trajectories in the space of their two principal components and 1027 clustering them as in "PCA_traj". 1028 1029 Parameters 1030 ---------- 1031 n_top_clones : int 1032 number of most abundant clontypes in each time point to consider in the PCA 1033 nclus : float 1034 number of clusters 1035 colormap : str 1036 colormap indicating the different clusters 1037 figsize : tuple 1038 width, height in inches 1039 Returns 1040 ------- 1041 ax : matplotlib.axes._subplots.AxesSubplot 1042 axes where to draw the plot 1043 fig : matplotlib.figure.Figure 1044 matplotlib figure 1045 """ 1046 1047 1048 cmap = cm.get_cmap(colormap) 1049 pca, clustering = self.PCA_traj(n_top_clones, nclus) 1050 1051 fig, ax = plt.subplots(figsize=figsize) 1052 ax.set_title('PCA components (%i trajs)' %pca.n_features_, fontsize = 25) 1053 ax.set_xlabel('First component (expl var: %3.2f)'%pca.explained_variance_ratio_[0], fontsize = 25) 1054 ax.set_ylabel('Second component (expl var: %3.2f)'%pca.explained_variance_ratio_[1], fontsize = 25) 1055 for c_ind in range(clustering.n_clusters): 1056 x = pca.components_[0][clustering.labels_ == c_ind] 1057 y = pca.components_[1][clustering.labels_ == c_ind] 1058 ax.scatter(x, y, alpha=0.2, color=cmap(c_ind/clustering.n_clusters)) 1059 1060 return fig, ax 1061 1062 1063 def plot_PCA_clusters_traj(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)): 1064 1065 """ 1066 Plotting the trajectories grouped by PCA clusters 1067 1068 Parameters 1069 ---------- 1070 n_top_clones : int 1071 number of most abundant clontypes in each time point to consider in the PCA 1072 nclus : float 1073 number of clusters 1074 colormap : str 1075 colormap indicating the different clusters 1076 figsize : tuple 1077 width, height in inches 1078 Returns 1079 ------- 1080 axs : tuple of matplotlib.axes._subplots.AxesSubplot 1081 axis where to draw the plot 1082 fig : matplotlib.figure.Figure 1083 matplotlib figure 1084 """ 1085 1086 cmap = cm.get_cmap(colormap) 1087 pca, clustering = self.PCA_traj(n_top_clones, nclus) 1088 1089 n_cl = clustering.n_clusters 1090 1091 #Getting the top n_top_clones clonotypes at each time point 1092 top_clones = self.top_clones_set(n_top_clones) 1093 #Building a trajectory dataframe 1094 traj_frame = self.build_traj_frame(top_clones) 1095 1096 #Converting it in a numpy matrix 1097 traj_matrix = traj_frame.drop(['Clone cumul freq'], axis=1).to_numpy() 1098 1099 # Normalize each trajectory by its maximum 1100 norm_traj_matrix = traj_matrix/np.max(traj_matrix, axis=1)[:, np.newaxis] 1101 1102 fig, axs = plt.subplots(2, n_cl, figsize=(5*n_cl, 12)) 1103 for cl in range(n_cl): 1104 trajs = norm_traj_matrix[clustering.labels_ == cl] 1105 axs[0][cl].set_xlabel('Time', fontsize = 15) 1106 axs[0][cl].set_ylabel('Normalized frequency', fontsize = 15) 1107 axs[1][cl].set_xlabel('Time', fontsize = 15) 1108 axs[1][cl].set_ylabel('Normalized frequency', fontsize = 15) 1109 for traj in trajs: 1110 axs[0][cl].plot(self.times, traj, alpha=0.2, color=cmap(cl/n_cl)) 1111 axs[1][cl].set_ylim(0,1) 1112 axs[1][cl].errorbar(self.times, np.mean(trajs, axis=0), 1113 yerr=np.std(trajs, axis=0), lw=3, color=cmap(cl/n_cl)) 1114 #axs[1][cl].fill_between(times, np.quantile(trajs, 0.75, axis=0), np.quantile(trajs, 0.25, axis=0), color=colors[cl]) 1115 1116 plt.tight_layout() 1117 return fig, axs
This class provides some tool to inspect and compute some simple statistics on longitudinal data associated with one individual (it is independent of the NoisET software).
...
Attributes
- clone_count_label (str): label in the clonotype tables indicating the clonotype count
- seq_label (str): label in the clonotype tables indicating the sequence of the receptor
- clones (dict of pandas.DataFrame): dictionary containing the clonotype tables as pandas frames. The keys are strings "patient_time", replicated are merged. Created in the initalization
- times (list of float): ordered times of the imported tables. Created in the initialization
- unique_clones (list of str): list of all the unique clonotype sequences in all the time points
- time_occurrence (list of int): number of time points in which each clonotype appears. The index refers to the clonotype in the unique_clones list
Methods
compute_clone_time_occurrence() It creates two new attribues: the list of uniqe clonotypes in all the dataset "self.unique_clones" and the time occurrence of each of them "self.time_occurrence". the time occurrence is the number of time points in which the clone appears. plot_hist_persistence(figsize=(12,10)) It plots the distribution of time occurrence of the unique clonotypes top_clones_set(n_top_clones) Compute the set of top clones as the union of the "n_top_clones" most abundant clonotype in each time point build_traj_frame(top_clones_set) Compute the set of top clones as the union of the "n_top_clones" most abundant clonotype in each time point plot_trajectories(n_top_clones, colormap='viridis', figsize=(12,10)) Function to plot the trajectories of the first "n_top_clones". Colors of the trajectories represent the cumulative frequency in all the time points. PCA_traj(n_top_clones, nclus=4) Perform PCA over the normalized trajectories of n_top_clones TCR clones. The normalization consists in dividing the whole trajectory by its maximum value. After PCA the trajectories are clustered in the two principal componets space with a hierarchical clustering algorithm. plot_PCA2(n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)) Plotting the trajectories in the space of their two principal components and clustering them as in "PCA_traj". plot_PCA_clusters_traj(n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)) Plotting the trajectories grouped by PCA clusters
747 def __init__(self, patient, data_folder, sequence_label='N. Seq. CDR3', clone_count_label='Clone count', 748 replicate1_label='_F1', replicate2_label='_F2', separator='\t'): 749 """ 750 Import all the clonotypes of a given patient and store them in the dictionary "self.clones". 751 It also creates the list of times "self.times". During this process the replicates at the 752 same time points are merged together. 753 The names of the tables containing TCR should be structured as "patient_time_replicate.csv". 754 Those tables should be cvs files compressed in a zip archive (see the example notebook). 755 Parameters 756 ---------- 757 patient : str 758 The ID of the patient 759 data_folder : str 760 folder name containing the csv files listing the T-cell receptors 761 separator : str 762 separator symbol in the csv tables 763 """ 764 765 self.clone_count_label = clone_count_label 766 self.seq_label = sequence_label 767 self.unique_clones = None 768 self.time_occurrence = None 769 self.times = [] 770 clones_repl = dict() 771 772 # Iteration over all the file in the folder for importing each table 773 for file_name in os.listdir(data_folder): 774 # If the name before the underscore corresponds to the chosen patient.. 775 if file_name.split('_')[0] == patient: 776 # Import the table 777 frame = pd.read_csv(data_folder+file_name, sep='\t', compression=dict(method='zip')) 778 # Store it in a dictionary where the key contains the patient, the time 779 # and the replicate. 780 clones_repl[file_name[:-10]] = frame 781 # Reading the time from the name and storing it 782 self.times.append(int(file_name.split('_')[1])) 783 print('Clonotypes',file_name[:-10],'imported') 784 785 # Sorting the unique times 786 self.times = np.sort(list(set(self.times))) 787 self.clones = self._merge_replicates(patient, clones_repl, replicate1_label, replicate2_label)
Import all the clonotypes of a given patient and store them in the dictionary "self.clones". It also creates the list of times "self.times". During this process the replicates at the same time points are merged together. The names of the tables containing TCR should be structured as "patient_time_replicate.csv". Those tables should be cvs files compressed in a zip archive (see the example notebook).
Parameters
- patient (str): The ID of the patient
- data_folder (str): folder name containing the csv files listing the T-cell receptors
- separator (str): separator symbol in the csv tables
814 def compute_clone_time_occurrence(self): 815 816 """ 817 It creates two new attribues: the list of uniqe clonotypes in all the dataset 818 "self.unique_clones" and the time occurrence of each of them "self.time_occurrence". 819 the time occurrence is the number of time points in which the clone appears. 820 """ 821 822 all_clones = np.array([]) 823 for id_, cl in self.clones.items(): 824 all_clones = np.append(all_clones, cl[self.seq_label].values) 825 826 # The following function returns the list of unique clonotypes and the number of 827 # repetitions for each of them. 828 # Note that the number of repetitions is exactly the time occurrence 829 self.unique_clones, self.time_occurrence = np.unique(all_clones, return_counts=True)
It creates two new attribues: the list of uniqe clonotypes in all the dataset "self.unique_clones" and the time occurrence of each of them "self.time_occurrence". the time occurrence is the number of time points in which the clone appears.
832 def plot_hist_persistence(self, figsize=(12,10)): 833 834 """ 835 It plots the distribution of time occurrence of the unique clonotypes 836 Parameters 837 ---------- 838 figsize : tuple 839 width, height in inches 840 841 Returns 842 ------- 843 ax : matplotlib.axes._subplots.AxesSubplot 844 axes where to draw the plot 845 fig : matplotlib.figure.Figure 846 matplotlib figure 847 """ 848 849 if type(self.unique_clones) != np.ndarray: 850 self.compute_clone_time_occurrence() 851 852 fig, ax = plt.subplots(figsize=figsize) 853 854 plt.rc('xtick', labelsize = 30) 855 plt.rc('ytick', labelsize = 30) 856 857 ax.set_yscale('log') 858 ax.set_xlabel('Time occurrence', fontsize = 30) 859 ax.set_ylabel('Counts', fontsize = 30) 860 ax.hist(self.time_occurrence, bins=np.arange(1,len(self.times)+2)-0.5, rwidth=0.6) 861 862 return fig, ax
It plots the distribution of time occurrence of the unique clonotypes
Parameters
- figsize (tuple): width, height in inches
Returns
- ax (matplotlib.axes._subplots.AxesSubplot): axes where to draw the plot
- fig (matplotlib.figure.Figure): matplotlib figure
865 def top_clones_set(self, n_top_clones): 866 867 """ 868 Compute the set of top clones as the union of the "n_top_clones" most abundant 869 clonotype in each time point 870 Parameters 871 ---------- 872 n_top_clones : int 873 number of most abundant clontypes in each time point 874 Returns 875 ------- 876 top_clones : set of str 877 set of top clones 878 """ 879 880 top_clones = set() 881 for id_, cl in self.clones.items(): 882 top_clones_at_time = cl.sort_values(self.clone_count_label, ascending=False)[:n_top_clones] 883 top_clones = top_clones.union(top_clones_at_time[self.seq_label].values) 884 return top_clones
Compute the set of top clones as the union of the "n_top_clones" most abundant clonotype in each time point
Parameters
- n_top_clones (int): number of most abundant clontypes in each time point
Returns
- top_clones (set of str): set of top clones
887 def build_traj_frame(self, clone_set): 888 889 """ 890 This builds a dataframe containing the frequency at all the time points for each 891 of the clonotypes specified in clone_set. 892 The dataframe has also a field that contains the cumulative frequency. 893 Parameters 894 ---------- 895 clones_set : iterable of str 896 list of clonotypes whose temporal trajectory is drawn 897 Returns 898 ------- 899 traj_frame : pandas.DataFrame 900 dataframe containing the frequency at all the time points 901 """ 902 903 traj_frame = pd.DataFrame(index=clone_set) 904 traj_frame['Clone cumul freq'] = 0 905 906 for id_, cl in self.clones.items(): 907 908 # Getting the time from the index of clones_merged 909 t = id_.split('_')[1] 910 # Selecting the clonotypes that are both in the frame at the given time 911 # point and in the list of top_clones_set 912 top_clones_at_time = clone_set.intersection(set(cl[self.seq_label])) 913 # Creating a sub-dataframe containing only the clone in top_clones_at_time 914 clones_at_time = cl.set_index(self.seq_label).loc[top_clones_at_time] 915 # Creating a new column in the trajectory frames for the counts at that time 916 traj_frame['t'+str(t)] = traj_frame.index.map(clones_at_time['Clone freq'].to_dict()) 917 # The clonotypes not present at that time are NaN. Below we convert NaN in 0s 918 traj_frame = traj_frame.fillna(0) 919 # The cumulative count for each clonotype is updated 920 traj_frame['Clone cumul freq'] += traj_frame['t'+str(t)] 921 922 return traj_frame
This builds a dataframe containing the frequency at all the time points for each of the clonotypes specified in clone_set. The dataframe has also a field that contains the cumulative frequency.
Parameters
- clones_set (iterable of str): list of clonotypes whose temporal trajectory is drawn
Returns
- traj_frame (pandas.DataFrame): dataframe containing the frequency at all the time points
929 def plot_trajectories(self, n_top_clones, colormap='viridis', figsize=(12,10)): 930 931 """ 932 Function to plot the trajectories of the first "n_top_clones". Colors of the 933 trajectories represent the cumulative frequency in all the time points. 934 935 Parameters 936 ---------- 937 n_top_clones : int 938 number of most abundant clontypes in each time point 939 colormap : str 940 colors of the trajectories 941 942 figsize : tuple 943 width, height in inches 944 Returns 945 ------- 946 ax : matplotlib.axes._subplots.AxesSubplot 947 axes where to draw the plot 948 fig : matplotlib.figure.Figure 949 matplotlib figure 950 """ 951 952 cmap = cm.get_cmap(colormap) 953 top_clones = self.top_clones_set(n_top_clones) 954 traj_frame = self.build_traj_frame(top_clones) 955 956 fig, ax = plt.subplots(figsize=figsize) 957 plt.rc('xtick', labelsize = 30) 958 plt.rc('ytick', labelsize = 30) 959 ax.set_yscale('log') 960 ax.set_xlabel('time', fontsize = 25) 961 ax.set_ylabel('frequency', fontsize = 25) 962 963 log_counts = np.log10(traj_frame['Clone cumul freq'].values) 964 max_log_count = max(log_counts) 965 min_log_count = min(log_counts) 966 967 for id_, row in traj_frame.iterrows(): 968 traj = row.drop(['Clone cumul freq']).to_numpy() 969 log_count = np.log10(row['Clone cumul freq']) 970 norm_log_count = (log_count-min_log_count)/(max_log_count-min_log_count) 971 plt.plot(self.times, traj, c=cmap(norm_log_count)) 972 973 974 sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=min(log_counts), vmax=max(log_counts))) 975 cb = plt.colorbar(sm) 976 cb.set_label('Log10 cumulative frequency', fontsize = 25) 977 978 return fig, ax
Function to plot the trajectories of the first "n_top_clones". Colors of the trajectories represent the cumulative frequency in all the time points.
Parameters
- n_top_clones (int): number of most abundant clontypes in each time point
- colormap (str): colors of the trajectories
- figsize (tuple): width, height in inches
Returns
- ax (matplotlib.axes._subplots.AxesSubplot): axes where to draw the plot
- fig (matplotlib.figure.Figure): matplotlib figure
981 def PCA_traj(self, n_top_clones, nclus=4): 982 983 """ 984 Perform PCA over the normalized trajectories of n_top_clones TCR clones. 985 The normalization consists in dividing the whole trajectory by its maximum value. 986 After PCA the trajectories are clustered in the two principal componets space 987 with a hierarchical clustering algorithm. 988 989 Parameters 990 ---------- 991 n_top_clones : int 992 number of most abundant clontypes in each time point to consider in the PCA 993 nclus : float 994 number of clusters 995 996 Returns 997 ------- 998 pca : sklearn.decomposition._pca.PCA 999 object containing the result of the principal component analysis 1000 1001 clustering : sklearn.cluster._agglomerative.AgglomerativeClustering 1002 object containing the result of the hierarchical clustering 1003 """ 1004 1005 #Getting the top n_top_clones clonotypes at each time point 1006 top_clones = self.top_clones_set(n_top_clones) 1007 #Building a trajectory dataframe 1008 traj_frame = self.build_traj_frame(top_clones) 1009 1010 #Converting it in a numpy matrix 1011 traj_matrix = traj_frame.drop(['Clone cumul freq'], axis = 1).to_numpy() 1012 1013 # Normalize each trajectory by its maximum 1014 norm_traj_matrix = traj_matrix/np.max(traj_matrix, axis=1)[:, np.newaxis] 1015 1016 pca = PCA(n_components =2).fit(norm_traj_matrix.T) 1017 clustering = AgglomerativeClustering(n_clusters = nclus) 1018 clustering = clustering.fit(pca.components_.T) 1019 1020 return pca, clustering
Perform PCA over the normalized trajectories of n_top_clones TCR clones. The normalization consists in dividing the whole trajectory by its maximum value. After PCA the trajectories are clustered in the two principal componets space with a hierarchical clustering algorithm.
Parameters
- n_top_clones (int): number of most abundant clontypes in each time point to consider in the PCA
- nclus (float): number of clusters
Returns
- pca (sklearn.decomposition._pca.PCA): object containing the result of the principal component analysis
- clustering (sklearn.cluster._agglomerative.AgglomerativeClustering): object containing the result of the hierarchical clustering
1023 def plot_PCA2(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)): 1024 1025 """ 1026 Plotting the trajectories in the space of their two principal components and 1027 clustering them as in "PCA_traj". 1028 1029 Parameters 1030 ---------- 1031 n_top_clones : int 1032 number of most abundant clontypes in each time point to consider in the PCA 1033 nclus : float 1034 number of clusters 1035 colormap : str 1036 colormap indicating the different clusters 1037 figsize : tuple 1038 width, height in inches 1039 Returns 1040 ------- 1041 ax : matplotlib.axes._subplots.AxesSubplot 1042 axes where to draw the plot 1043 fig : matplotlib.figure.Figure 1044 matplotlib figure 1045 """ 1046 1047 1048 cmap = cm.get_cmap(colormap) 1049 pca, clustering = self.PCA_traj(n_top_clones, nclus) 1050 1051 fig, ax = plt.subplots(figsize=figsize) 1052 ax.set_title('PCA components (%i trajs)' %pca.n_features_, fontsize = 25) 1053 ax.set_xlabel('First component (expl var: %3.2f)'%pca.explained_variance_ratio_[0], fontsize = 25) 1054 ax.set_ylabel('Second component (expl var: %3.2f)'%pca.explained_variance_ratio_[1], fontsize = 25) 1055 for c_ind in range(clustering.n_clusters): 1056 x = pca.components_[0][clustering.labels_ == c_ind] 1057 y = pca.components_[1][clustering.labels_ == c_ind] 1058 ax.scatter(x, y, alpha=0.2, color=cmap(c_ind/clustering.n_clusters)) 1059 1060 return fig, ax
Plotting the trajectories in the space of their two principal components and clustering them as in "PCA_traj".
Parameters
- n_top_clones (int): number of most abundant clontypes in each time point to consider in the PCA
- nclus (float): number of clusters
- colormap (str): colormap indicating the different clusters
- figsize (tuple): width, height in inches
Returns
- ax (matplotlib.axes._subplots.AxesSubplot): axes where to draw the plot
- fig (matplotlib.figure.Figure): matplotlib figure
1063 def plot_PCA_clusters_traj(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)): 1064 1065 """ 1066 Plotting the trajectories grouped by PCA clusters 1067 1068 Parameters 1069 ---------- 1070 n_top_clones : int 1071 number of most abundant clontypes in each time point to consider in the PCA 1072 nclus : float 1073 number of clusters 1074 colormap : str 1075 colormap indicating the different clusters 1076 figsize : tuple 1077 width, height in inches 1078 Returns 1079 ------- 1080 axs : tuple of matplotlib.axes._subplots.AxesSubplot 1081 axis where to draw the plot 1082 fig : matplotlib.figure.Figure 1083 matplotlib figure 1084 """ 1085 1086 cmap = cm.get_cmap(colormap) 1087 pca, clustering = self.PCA_traj(n_top_clones, nclus) 1088 1089 n_cl = clustering.n_clusters 1090 1091 #Getting the top n_top_clones clonotypes at each time point 1092 top_clones = self.top_clones_set(n_top_clones) 1093 #Building a trajectory dataframe 1094 traj_frame = self.build_traj_frame(top_clones) 1095 1096 #Converting it in a numpy matrix 1097 traj_matrix = traj_frame.drop(['Clone cumul freq'], axis=1).to_numpy() 1098 1099 # Normalize each trajectory by its maximum 1100 norm_traj_matrix = traj_matrix/np.max(traj_matrix, axis=1)[:, np.newaxis] 1101 1102 fig, axs = plt.subplots(2, n_cl, figsize=(5*n_cl, 12)) 1103 for cl in range(n_cl): 1104 trajs = norm_traj_matrix[clustering.labels_ == cl] 1105 axs[0][cl].set_xlabel('Time', fontsize = 15) 1106 axs[0][cl].set_ylabel('Normalized frequency', fontsize = 15) 1107 axs[1][cl].set_xlabel('Time', fontsize = 15) 1108 axs[1][cl].set_ylabel('Normalized frequency', fontsize = 15) 1109 for traj in trajs: 1110 axs[0][cl].plot(self.times, traj, alpha=0.2, color=cmap(cl/n_cl)) 1111 axs[1][cl].set_ylim(0,1) 1112 axs[1][cl].errorbar(self.times, np.mean(trajs, axis=0), 1113 yerr=np.std(trajs, axis=0), lw=3, color=cmap(cl/n_cl)) 1114 #axs[1][cl].fill_between(times, np.quantile(trajs, 0.75, axis=0), np.quantile(trajs, 0.25, axis=0), color=colors[cl]) 1115 1116 plt.tight_layout() 1117 return fig, axs
Plotting the trajectories grouped by PCA clusters
Parameters
- n_top_clones (int): number of most abundant clontypes in each time point to consider in the PCA
- nclus (float): number of clusters
- colormap (str): colormap indicating the different clusters
- figsize (tuple): width, height in inches
Returns
- axs (tuple of matplotlib.axes._subplots.AxesSubplot): axis where to draw the plot
- fig (matplotlib.figure.Figure): matplotlib figure
1121class Data_Process(): 1122 1123 """ 1124 A class used to represent longitudinal RepSeq data and pre-analysis of the longitudinal data associated with 1125 one individual. 1126 ... 1127 Attributes 1128 ---------- 1129 path : str 1130 the name of the path to get access to the data files to use for our analysis 1131 filename1 : str 1132 the name of the file of the RepSeq sample which can be the first replicate when deciphering the experimental noise 1133 or the first time point RepSeq sample when analysing responding clones to a stimulus between two time points. 1134 filename2 : str 1135 the name of the file of the RepSeq sample which can be the second replicate when deciphering the experimental noise 1136 or the second time point RepSeq sample when analysing responding clones to a stimulus between two time points. 1137 colnames1 : str 1138 list of columns names of data-set - first sample 1139 colnames2 : str 1140 list of columns names of data-set - second sample 1141 Methods 1142 ------- 1143 import_data() : 1144 to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the 1145 union of both samples. 1146 1147 """ 1148 1149 def __init__(self, path, filename1, filename2, colnames1, colnames2): 1150 1151 self.path = path 1152 self.filename1 = filename1 1153 self.filename2 = filename2 1154 self.colnames1 = colnames1 1155 self.colnames2 = colnames2 1156 1157 1158 def import_data(self): 1159 """ 1160 to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the union of both samples. 1161 1162 Parameters 1163 ---------- 1164 NONE 1165 Returns 1166 ------- 1167 number_clones 1168 numpy array, number of clones in the data frame which is the union of the two RepSeq used as entries of the function 1169 df 1170 pandas data-frame which is the data-frame containing the informations labeled in colnames vector string 1171 for both RepSeq samples taken as input. 1172 """ 1173 1174 mincount = 0 1175 maxcount = np.inf 1176 1177 headerline=0 #line number of headerline 1178 newnames=['Clone_fraction','Clone_count','ntCDR3','AACDR3'] 1179 1180 if self.filename1[-2:] == 'gz': 1181 F1Frame_chunk=pd.read_csv(self.path + self.filename1, delimiter='\t',usecols=self.colnames1,header=headerline, compression = 'gzip')[self.colnames1] 1182 else: 1183 F1Frame_chunk=pd.read_csv(self.path + self.filename1, delimiter='\t',usecols=self.colnames1,header=headerline)[self.colnames1] 1184 1185 if self.filename2[-2:] == 'gz': 1186 F2Frame_chunk=pd.read_csv(self.path + self.filename2, delimiter='\t',usecols=self.colnames2,header=headerline, compression = 'gzip')[self.colnames2] 1187 1188 else: 1189 F2Frame_chunk=pd.read_csv(self.path + self.filename2, delimiter='\t',usecols=self.colnames2,header=headerline)[self.colnames2] 1190 1191 F1Frame_chunk.columns=newnames 1192 F2Frame_chunk.columns=newnames 1193 suffixes=('_1','_2') 1194 mergedFrame=pd.merge(F1Frame_chunk,F2Frame_chunk,on=newnames[2],suffixes=suffixes,how='outer') 1195 for nameit in [0,1]: 1196 for labelit in suffixes: 1197 mergedFrame.loc[:,newnames[nameit]+labelit].fillna(int(0),inplace=True) 1198 if nameit==1: 1199 mergedFrame.loc[:,newnames[nameit]+labelit].astype(int) 1200 def dummy(x): 1201 val=x[0] 1202 if pd.isnull(val): 1203 val=x[1] 1204 return val 1205 mergedFrame.loc[:,newnames[3]+suffixes[0]]=mergedFrame.loc[:,[newnames[3]+suffixes[0],newnames[3]+suffixes[1]]].apply(dummy,axis=1) #assigns AA sequence to clones, creates duplicates 1206 mergedFrame.drop(newnames[3]+suffixes[1], 1,inplace=True) #removes duplicates 1207 mergedFrame.rename(columns = {newnames[3]+suffixes[0]:newnames[3]}, inplace = True) 1208 mergedFrame=mergedFrame[[newname+suffix for newname in newnames[:2] for suffix in suffixes]+[newnames[2],newnames[3]]] 1209 filterout=((mergedFrame.Clone_count_1<mincount) & (mergedFrame.Clone_count_2==0)) | ((mergedFrame.Clone_count_2<mincount) & (mergedFrame.Clone_count_1==0)) #has effect only if mincount>0 1210 number_clones=len(mergedFrame) 1211 return number_clones,mergedFrame.loc[((mergedFrame.Clone_count_1<=maxcount) & (mergedFrame.Clone_count_2<=maxcount)) & ~filterout]
A class used to represent longitudinal RepSeq data and pre-analysis of the longitudinal data associated with one individual. ...
Attributes
- path (str): the name of the path to get access to the data files to use for our analysis
- filename1 (str): the name of the file of the RepSeq sample which can be the first replicate when deciphering the experimental noise or the first time point RepSeq sample when analysing responding clones to a stimulus between two time points.
- filename2 (str): the name of the file of the RepSeq sample which can be the second replicate when deciphering the experimental noise or the second time point RepSeq sample when analysing responding clones to a stimulus between two time points.
- colnames1 (str): list of columns names of data-set - first sample
- colnames2 (str): list of columns names of data-set - second sample
Methods
import_data() : to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the union of both samples.
1158 def import_data(self): 1159 """ 1160 to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the union of both samples. 1161 1162 Parameters 1163 ---------- 1164 NONE 1165 Returns 1166 ------- 1167 number_clones 1168 numpy array, number of clones in the data frame which is the union of the two RepSeq used as entries of the function 1169 df 1170 pandas data-frame which is the data-frame containing the informations labeled in colnames vector string 1171 for both RepSeq samples taken as input. 1172 """ 1173 1174 mincount = 0 1175 maxcount = np.inf 1176 1177 headerline=0 #line number of headerline 1178 newnames=['Clone_fraction','Clone_count','ntCDR3','AACDR3'] 1179 1180 if self.filename1[-2:] == 'gz': 1181 F1Frame_chunk=pd.read_csv(self.path + self.filename1, delimiter='\t',usecols=self.colnames1,header=headerline, compression = 'gzip')[self.colnames1] 1182 else: 1183 F1Frame_chunk=pd.read_csv(self.path + self.filename1, delimiter='\t',usecols=self.colnames1,header=headerline)[self.colnames1] 1184 1185 if self.filename2[-2:] == 'gz': 1186 F2Frame_chunk=pd.read_csv(self.path + self.filename2, delimiter='\t',usecols=self.colnames2,header=headerline, compression = 'gzip')[self.colnames2] 1187 1188 else: 1189 F2Frame_chunk=pd.read_csv(self.path + self.filename2, delimiter='\t',usecols=self.colnames2,header=headerline)[self.colnames2] 1190 1191 F1Frame_chunk.columns=newnames 1192 F2Frame_chunk.columns=newnames 1193 suffixes=('_1','_2') 1194 mergedFrame=pd.merge(F1Frame_chunk,F2Frame_chunk,on=newnames[2],suffixes=suffixes,how='outer') 1195 for nameit in [0,1]: 1196 for labelit in suffixes: 1197 mergedFrame.loc[:,newnames[nameit]+labelit].fillna(int(0),inplace=True) 1198 if nameit==1: 1199 mergedFrame.loc[:,newnames[nameit]+labelit].astype(int) 1200 def dummy(x): 1201 val=x[0] 1202 if pd.isnull(val): 1203 val=x[1] 1204 return val 1205 mergedFrame.loc[:,newnames[3]+suffixes[0]]=mergedFrame.loc[:,[newnames[3]+suffixes[0],newnames[3]+suffixes[1]]].apply(dummy,axis=1) #assigns AA sequence to clones, creates duplicates 1206 mergedFrame.drop(newnames[3]+suffixes[1], 1,inplace=True) #removes duplicates 1207 mergedFrame.rename(columns = {newnames[3]+suffixes[0]:newnames[3]}, inplace = True) 1208 mergedFrame=mergedFrame[[newname+suffix for newname in newnames[:2] for suffix in suffixes]+[newnames[2],newnames[3]]] 1209 filterout=((mergedFrame.Clone_count_1<mincount) & (mergedFrame.Clone_count_2==0)) | ((mergedFrame.Clone_count_2<mincount) & (mergedFrame.Clone_count_1==0)) #has effect only if mincount>0 1210 number_clones=len(mergedFrame) 1211 return number_clones,mergedFrame.loc[((mergedFrame.Clone_count_1<=maxcount) & (mergedFrame.Clone_count_2<=maxcount)) & ~filterout]
to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the union of both samples.
Parameters
- NONE
Returns
- number_clones: numpy array, number of clones in the data frame which is the union of the two RepSeq used as entries of the function
- df: pandas data-frame which is the data-frame containing the informations labeled in colnames vector string for both RepSeq samples taken as input.
1219class Noise_Model(): 1220 1221 """ 1222 A class used to build an object associated to methods in order to learn the experimental noise from same day 1223 biological RepSeq samples. 1224 ... 1225 Methods 1226 ------- 1227 get_sparserep(df) : 1228 get sparse representation of the abundances / frequencies of the TCR clones present in both RepSeq samples of interest. 1229 this changes the data input to fasten the algorithm 1230 learn_null_model(df, noise_model, init_paras, output_dir = None, filename = None, display_loss_function = False) : 1231 function to optimize the likelihood associated to the experimental noise model and get the associated parameters. 1232 diversity_estimate(df, paras, noise_model) : 1233 function to get the estimation of diversity from the noise model information. 1234 """ 1235 1236 1237 def get_sparserep(self, df): 1238 """ 1239 Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation. 1240 unicountvals_1(2) are the unique values of n1(2). 1241 sparse_rep_counts gives the counts of unique pairs. 1242 ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair. 1243 len(indn1)=len(indn2)=len(sparse_rep_counts) 1244 Parameters 1245 ---------- 1246 df : pandas data frame 1247 data-frame which is the output of the method .import_data() for one Data_Process instance. 1248 these data-frame should give the list of TCR clones present in two replicates RepSeq samples 1249 associated to their clone frequencies and clone abundances in the first and second replicate. 1250 Returns 1251 ------- 1252 indn1 1253 numpy array list of indexes of all values of unicountvals_1 1254 indn2 1255 numpy array list of indexes of all values of unicountvals_2 1256 sparse_rep_counts 1257 numpy array, # of clones having the read counts pair {(n1,n2)} 1258 unicountvals_1 1259 numpy array list of unique counts values present in the first sample in df[clone_count_1] 1260 unicountvals_2 1261 numpy array list of unique counts values present in the second sample in df[clone_count_2] 1262 Nreads1 1263 float, total number of counts/reads in the first sample referred in df by "_1" 1264 Nreads2 1265 float, total number of counts/reads in the second sample referred in df by "_2" 1266 """ 1267 1268 counts = df.loc[:,['Clone_count_1', 'Clone_count_2']] 1269 counts['paircount'] = 1 # gives a weight of 1 to each observed clone 1270 1271 clone_counts = counts.groupby(['Clone_count_1', 'Clone_count_2']).sum() 1272 sparse_rep_counts = np.asarray(clone_counts.values.flatten(), dtype=int) 1273 clonecountpair_vals = clone_counts.index.values 1274 indn1 = np.asarray([clonecountpair_vals[it][0] for it in range(len(sparse_rep_counts))], dtype=int) 1275 indn2 = np.asarray([clonecountpair_vals[it][1] for it in range(len(sparse_rep_counts))], dtype=int) 1276 NreadsI = np.sum(counts['Clone_count_1']) 1277 NreadsII = np.sum(counts['Clone_count_2']) 1278 1279 unicountvals_1, indn1 = np.unique(indn1, return_inverse=True) 1280 unicountvals_2, indn2 = np.unique(indn2, return_inverse=True) 1281 1282 return indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII 1283 1284 1285 1286 def _NegBinPar(self,m,v,mvec): 1287 ''' 1288 Same as NegBinParMtr, but for m and v being scalars. 1289 Assumes m>0. 1290 Output is (len(mvec),) array 1291 ''' 1292 mmax=mvec[-1] 1293 p = 1-m/v 1294 r = m*m/v/p 1295 NBvec=np.arange(mmax+1,dtype=float) 1296 NBvec[1:]=np.log((NBvec[1:]+r-1)/NBvec[1:]*p) #vectorization won't help unfortuneately here since log needs to be over array 1297 NBvec[0]=r*math.log(m/v) 1298 NBvec=np.exp(np.cumsum(NBvec)[mvec]) #save a bit here 1299 return NBvec 1300 1301 def _NegBinParMtr(self,m,v,nvec): #speed up only insofar as the log and exp are called once on array instead of multiple times on rows 1302 ''' 1303 computes NegBin probabilities over the ordered (but possibly discontiguous) vector (nvec) 1304 for mean/variance combinations given by the mean (m) and variance (v) vectors. 1305 Note that m<v for negative binomial. 1306 Output is (len(m),len(nvec)) array 1307 ''' 1308 nmax=nvec[-1] 1309 p = 1-m/v 1310 r = m*m/v/p 1311 NBvec=np.arange(nmax+1,dtype=float) 1312 NBvec=np.log((NBvec+r[:,np.newaxis]-1)*(p[:,np.newaxis]/NBvec)) 1313 NBvec[:,0]=r*np.log(m/v) #handle NBvec[0]=0, treated specially when m[0]=0, see below 1314 NBvec=np.exp(np.cumsum(NBvec,axis=1)) #save a bit here 1315 if m[0]==0: 1316 NBvec[0,:]=0. 1317 NBvec[0,0]=1. 1318 NBvec=NBvec[:,nvec] 1319 return NBvec 1320 1321 def _PoisPar(self, Mvec,unicountvals): 1322 #assert Mvec[0]==0, "first element needs to be zero" 1323 nmax=unicountvals[-1] 1324 nlen=len(unicountvals) 1325 mlen=len(Mvec) 1326 Nvec=unicountvals 1327 logNvec=-np.insert(np.cumsum(np.log(np.arange(1,nmax+1))),0,0.)[unicountvals] #avoid n=0 nans 1328 Nmtr=np.exp(Nvec[np.newaxis,:]*np.log(Mvec)[:,np.newaxis]+logNvec[np.newaxis,:]-Mvec[:,np.newaxis]) # np.log(Mvec) throws warning: since log(0)=-inf 1329 if Mvec[0]==0: 1330 Nmtr[0,:]=np.zeros((nlen,)) #when m=0, n=0, and so get rid of nans from log(0) 1331 Nmtr[0,0]=1. #handled belowacq_model_type 1332 if unicountvals[0]==0: #if n=0 included get rid of nans from log(0) 1333 Nmtr[:,0]=np.exp(-Mvec) 1334 return Nmtr 1335 1336 def _get_rhof(self,alpha_rho, nfbins,fmin,freq_dtype): 1337 ''' 1338 generates power law (power is alpha_rho) clone frequency distribution over 1339 freq_nbins discrete logarithmically spaced frequences between fmin and 1 of dtype freq_dtype 1340 Outputs log probabilities obtained at log frequencies''' 1341 fmax=1e0 1342 logfvec=np.linspace(np.log10(fmin),np.log10(fmax), nfbins) 1343 logfvec=np.array(np.log(np.power(10,logfvec)) ,dtype=freq_dtype).flatten() 1344 logrhovec=logfvec*alpha_rho 1345 integ=np.exp(logrhovec+logfvec,dtype=freq_dtype) 1346 normconst=np.log(np.dot(np.diff(logfvec)/2.,integ[1:]+integ[:-1])) 1347 logrhovec-=normconst 1348 return logrhovec,logfvec, normconst 1349 1350 1351 def _get_logPn_f(self,unicounts,Nreads,logfvec, noise_model, paras): 1352 1353 """ 1354 tools to compute the likelihood of the noise model. It is not useful for the user. 1355 """ 1356 1357 # Choice of the model: 1358 1359 if noise_model<1: 1360 1361 m_total=float(np.power(10, paras[3])) 1362 r_c=Nreads/m_total 1363 if noise_model<2: 1364 1365 beta_mv= paras[1] 1366 alpha_mv=paras[2] 1367 1368 if noise_model<1: #for models that include cell counts 1369 #compute parametrized range (mean-sigma,mean+5*sigma) of m values (number of cells) conditioned on n values (reads) appearing in the data only 1370 nsigma=5. 1371 nmin=300. 1372 #for each n, get actual range of m to compute around n-dependent mean m 1373 m_low =np.zeros((len(unicounts),),dtype=int) 1374 m_high=np.zeros((len(unicounts),),dtype=int) 1375 for nit,n in enumerate(unicounts): 1376 mean_m=n/r_c 1377 dev=nsigma*np.sqrt(mean_m) 1378 m_low[nit] =int(mean_m- dev) if (mean_m>dev**2) else 0 1379 m_high[nit]=int(mean_m+5*dev) if ( n>nmin) else int(10*nmin/r_c) 1380 m_cellmax=np.max(m_high) 1381 #across n, collect all in-range m 1382 mvec_bool=np.zeros((m_cellmax+1,),dtype=bool) #cheap bool 1383 nvec=range(len(unicounts)) 1384 for nit in nvec: 1385 mvec_bool[m_low[nit]:m_high[nit]+1]=True #mask vector 1386 mvec=np.arange(m_cellmax+1)[mvec_bool] 1387 #transform to in-range index 1388 for nit in nvec: 1389 m_low[nit]=np.where(m_low[nit]==mvec)[0][0] 1390 m_high[nit]=np.where(m_high[nit]==mvec)[0][0] 1391 1392 Pn_f=np.zeros((len(logfvec),len(unicounts))) 1393 if noise_model==0: 1394 1395 mean_m=m_total*np.exp(logfvec) 1396 var_m=mean_m+beta_mv*np.power(mean_m,alpha_mv) 1397 Poisvec = self._PoisPar(mvec*r_c,unicounts) 1398 for f_it in range(len(logfvec)): 1399 NBvec=self._NegBinPar(mean_m[f_it],var_m[f_it],mvec) 1400 for n_it,n in enumerate(unicounts): 1401 Pn_f[f_it,n_it]=np.dot(NBvec[m_low[n_it]:m_high[n_it]+1],Poisvec[m_low[n_it]:m_high[n_it]+1,n_it]) 1402 1403 elif noise_model==1: 1404 1405 mean_n=Nreads*np.exp(logfvec) 1406 var_n=mean_n+beta_mv*np.power(mean_n,alpha_mv) 1407 Pn_f = self._NegBinParMtr(mean_n,var_n,unicounts) 1408 elif noise_model==2: 1409 1410 mean_n=Nreads*np.exp(logfvec) 1411 Pn_f= self._PoisPar(mean_n,unicounts) 1412 else: 1413 print('acq_model is 0,1, or 2 only') 1414 1415 return np.log(Pn_f) 1416 1417 #-----------------------------Null-Model-optimization-------------------------- 1418 1419 def _get_Pn1n2(self, paras, sparse_rep, noise_model): 1420 1421 """ 1422 Tool to compute likelihood of the noise model. It is not useful for the user. 1423 """ 1424 1425 indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2,NreadsI,NreadsII = sparse_rep 1426 1427 nfbins = 1200 1428 freq_dtype = float 1429 1430 # Parameters 1431 1432 alpha = paras[0] 1433 fmin = np.power(10,paras[-1]) 1434 1435 # 1436 logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype) 1437 1438 # 1439 1440 logfvec_tmp=deepcopy(logfvec) 1441 1442 logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI,logfvec_tmp, noise_model, paras) 1443 logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII,logfvec_tmp, noise_model, paras) 1444 1445 # for the trapezoid integral methods 1446 1447 dlogfby2=np.diff(logfvec)/2 1448 1449 # Compute P(0,0) for the normalization constraint 1450 integ = np.exp(logrhofvec + logPn2_f[:, 0] + logPn1_f[:, 0] + logfvec) 1451 Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1]) 1452 1453 #print("computing P(n1,n2)") 1454 Pn1n2 = np.zeros(len(sparse_rep_counts)) # 1D representation 1455 for it, (ind1, ind2) in enumerate(zip(indn1, indn2)): 1456 integ = np.exp(logPn1_f[:, ind1] + logrhofvec + logPn2_f[:, ind2] + logfvec) 1457 Pn1n2[it] = np.dot(dlogfby2, integ[1:] + integ[:-1]) 1458 Pn1n2 /= 1. - Pn0n0 # renormalize 1459 return -np.dot(sparse_rep_counts, np.where(Pn1n2 > 0, np.log(Pn1n2), 0)) / float(np.sum(sparse_rep_counts)) 1460 1461 1462 1463 1464 def _callback(self, paras, nparas, sparse_rep, noise_model): 1465 '''prints iteration info. called by scipy.minimize. Not useful for the user.''' 1466 1467 global curr_iter 1468 #curr_iter = 0 1469 global Loss_function 1470 print(''.join(['{0:d} ']+['{'+str(it)+':3.6f} ' for it in range(1,len(paras)+1)]).format(*([curr_iter]+list(paras)))) 1471 #print ('{' + str(len(paras)+1) + ':3.6f}'.format( [self.get_Pn1n2(paras, sparse_rep, acq_model_type)])) 1472 Loss_function = self._get_Pn1n2(paras, sparse_rep, noise_model) 1473 print(Loss_function) 1474 curr_iter += 1 1475 1476 1477 1478 # Constraints for the Null-Model, no filtered 1479 def _nullmodel_constr_fn(self, paras, sparse_rep, noise_model, constr_type): 1480 1481 ''' 1482 returns either or both of the two level-set functions: log<f>-log(1/N), with N=Nclones/(1-P(0,0)) and log(Z_f), with Z_f=N<f>_{n+n'=0} + sum_i^Nclones <f>_{f|n,n'} 1483 not useful for the user 1484 ''' 1485 1486 # Choice of the model: 1487 1488 indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII = sparse_rep 1489 1490 #Variables that would be chosen in the future by the user 1491 nfbins = 1200 1492 freq_dtype = float 1493 1494 alpha = paras[0] # power law exponent 1495 fmin = np.power(10, paras[-1]) # true minimal frequency 1496 1497 logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype) 1498 dlogfby2 = np.diff(logfvec) / 2. # 1/2 comes from trapezoid integration below 1499 1500 integ = np.exp(logrhofvec + 2 * logfvec) 1501 avgf_ps = np.dot(dlogfby2, integ[:-1] + integ[1:]) 1502 1503 logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI, logfvec, noise_model, paras) 1504 logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII, logfvec, noise_model, paras) 1505 1506 integ = np.exp(logPn1_f[:, 0] + logPn2_f[:, 0] + logrhofvec + logfvec) 1507 Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1]) 1508 logPnng0 = np.log(1 - Pn0n0) 1509 avgf_null_pair = np.exp(logPnng0 - np.log(np.sum(sparse_rep_counts))) 1510 1511 C1 = np.log(avgf_ps) - np.log(avgf_null_pair) 1512 1513 integ = np.exp(logPn1_f[:, 0] + logPn2_f[:, 0] + logrhofvec + 2 * logfvec) 1514 log_avgf_n0n0 = np.log(np.dot(dlogfby2, integ[1:] + integ[:-1])) 1515 1516 integ = np.exp(logPn1_f[:, indn1] + logPn2_f[:, indn2] + logrhofvec[:, np.newaxis] + logfvec[:, np.newaxis]) 1517 log_Pn1n2 = np.log(np.sum(dlogfby2[:, np.newaxis] * (integ[1:, :] + integ[:-1, :]), axis=0)) 1518 integ = np.exp(np.log(integ) + logfvec[:, np.newaxis]) 1519 tmp = deepcopy(log_Pn1n2) 1520 tmp[tmp == -np.Inf] = np.Inf # since subtracted in next line 1521 avgf_n1n2 = np.exp(np.log(np.sum(dlogfby2[:, np.newaxis] * (integ[1:, :] + integ[:-1, :]), axis=0)) - tmp) 1522 log_sumavgf = np.log(np.dot(sparse_rep_counts, avgf_n1n2)) 1523 1524 logNclones = np.log(np.sum(sparse_rep_counts)) - logPnng0 1525 Z = np.exp(logNclones + np.log(Pn0n0) + log_avgf_n0n0) + np.exp(log_sumavgf) 1526 1527 C2 = np.log(Z) 1528 1529 1530 # print('C1:'+str(C1)+' C2:'+str(C2)) 1531 if constr_type == 0: 1532 return C1 1533 elif constr_type == 1: 1534 return C2 1535 else: 1536 return C1, C2 1537 1538 1539 1540 # Null-Model optimization learning 1541 1542 def learn_null_model(self, df, noise_model, init_paras, output_dir = None, filename = None, display_loss_function = False): # constraint type 1 gives only low error modes, see paper for details. 1543 """ 1544 Parameters 1545 ---------- 1546 df : pandas data frame 1547 data-frame which is the output of the method .import_data() for one Data_Process instance. 1548 these data-frame should give the list of TCR clones present in two replicates RepSeq samples 1549 associated to their clone frequencies and clone abundances in the first and second replicate. 1550 noise_model: numpy array 1551 choice of noise model 1552 init_paras: numpy array 1553 initial vector of parameters to start the optimization of the model from data (df) 1554 output_dir : str 1555 default value is None, it is the output directory name i which we want to save the values of the parameters 1556 display_loss_function : bool 1557 boolean variable to chose if we want to print the loss function during the experimental noise learning, default value is 1558 None. 1559 1560 Returns 1561 ------- 1562 outstruct 1563 numpy array parameters of the noise model 1564 constr_value 1565 float, value of the constraint 1566 1567 """ 1568 1569 # Data introduction 1570 sparse_rep = self.get_sparserep(df) 1571 constr_type = 1 1572 1573 # Choice of the model: 1574 # Parameters initialization depending on the model 1575 if noise_model < 1: 1576 parameter_labels = ['alph_rho', 'beta', 'alpha', 'm_total', 'fmin'] 1577 elif noise_model == 1: 1578 parameter_labels = ['alph_rho', 'beta', 'alpha', 'fmin'] 1579 else: 1580 parameter_labels = ['alph_rho', 'fmin'] 1581 1582 assert len(parameter_labels) == len(init_paras), "number of model and initial paras differ!" 1583 1584 condict = {'type': 'eq', 'fun': self._nullmodel_constr_fn, 'args': (sparse_rep, noise_model, constr_type)} 1585 1586 1587 partialobjfunc = partial(self._get_Pn1n2, sparse_rep=sparse_rep, noise_model=noise_model) 1588 nullfunctol = 1e-6 1589 nullmaxiter = 200 1590 header = ['Iter'] + parameter_labels 1591 print(''.join(['{' + str(it) + ':9s} ' for it in range(len(init_paras) + 1)]).format(*header)) 1592 1593 global curr_iter 1594 curr_iter = 1 1595 callbackp = partial(self._callback, nparas=len(init_paras), sparse_rep = sparse_rep, noise_model= noise_model) 1596 outstruct = minimize(partialobjfunc, init_paras, method='SLSQP', callback=callbackp, constraints=condict, 1597 options={'ftol': nullfunctol, 'disp': True, 'maxiter': nullmaxiter}) 1598 1599 constr_value = self._nullmodel_constr_fn(outstruct.x, sparse_rep, noise_model, constr_type) 1600 1601 if noise_model < 1: 1602 parameter_labels = ['alph_rho', 'beta', 'alpha', 'm_total', 'fmin'] 1603 d = {'label' : parameter_labels, 'value': outstruct.x} 1604 df = pd.DataFrame(data = d) 1605 elif noise_model == 1: 1606 parameter_labels = ['alph_rho', 'beta', 'alpha', 'fmin'] 1607 d = {'label' : parameter_labels, 'value': outstruct.x} 1608 df = pd.DataFrame(data = d) 1609 else: 1610 parameter_labels = ['alph_rho', 'fmin'] 1611 d = {'label' : parameter_labels, 'value': outstruct.x} 1612 df = pd.DataFrame(data = d) 1613 1614 1615 if (output_dir == None) & (filename == None): 1616 df.to_csv('nullpara' + str(noise_model)+ '.txt', sep = '\t') 1617 1618 elif (output_dir != None) & (filename == None): 1619 df.to_csv(output_dir + '/nullpara' + str(noise_model)+ '.txt', sep = '\t') 1620 1621 else : 1622 df.to_csv(output_dir + '/' + filename + '.txt', sep = '\t') 1623 1624 return outstruct, constr_value 1625 1626 def diversity_estimate(self, df, paras, noise_model): 1627 1628 """ 1629 Estimate diversity of the individual repertoire from the experimental noise learning step. 1630 Parameters 1631 ---------- 1632 df : data-frame 1633 The data-frame which has been used to learn the noise model 1634 paras : numpy array 1635 vector containing the noise parameters 1636 noise_model : int 1637 choice of noise model 1638 Returns 1639 ------- 1640 diversity_estimate 1641 float, diversity estimate from the noise model inference. 1642 1643 """ 1644 1645 sparse_rep = self.get_sparserep(df) 1646 1647 indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2, NreadsI, NreadsII = sparse_rep 1648 1649 nfbins = 1200 1650 freq_dtype = float 1651 1652 # Parameters 1653 1654 alpha = paras[0] 1655 fmin = np.power(10,paras[-1]) 1656 1657 # 1658 logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype) 1659 1660 # 1661 1662 logfvec_tmp=deepcopy(logfvec) 1663 1664 logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI,logfvec_tmp, noise_model, paras) 1665 logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII,logfvec_tmp, noise_model, paras) 1666 1667 # for the trapezoid integral methods 1668 1669 dlogfby2=np.diff(logfvec)/2 1670 1671 # Compute P(0,0) for the normalization constraint 1672 integ = np.exp(logrhofvec + logPn2_f[:, 0] + logPn1_f[:, 0] + logfvec) 1673 Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1]) 1674 1675 #print(np.sum(sparse_rep_counts)) 1676 N_obs = np.sum(sparse_rep_counts) 1677 1678 return int(N_obs/(1-Pn0n0))
A class used to build an object associated to methods in order to learn the experimental noise from same day biological RepSeq samples. ...
Methods
get_sparserep(df) : get sparse representation of the abundances / frequencies of the TCR clones present in both RepSeq samples of interest. this changes the data input to fasten the algorithm learn_null_model(df, noise_model, init_paras, output_dir = None, filename = None, display_loss_function = False) : function to optimize the likelihood associated to the experimental noise model and get the associated parameters. diversity_estimate(df, paras, noise_model) : function to get the estimation of diversity from the noise model information.
1237 def get_sparserep(self, df): 1238 """ 1239 Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation. 1240 unicountvals_1(2) are the unique values of n1(2). 1241 sparse_rep_counts gives the counts of unique pairs. 1242 ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair. 1243 len(indn1)=len(indn2)=len(sparse_rep_counts) 1244 Parameters 1245 ---------- 1246 df : pandas data frame 1247 data-frame which is the output of the method .import_data() for one Data_Process instance. 1248 these data-frame should give the list of TCR clones present in two replicates RepSeq samples 1249 associated to their clone frequencies and clone abundances in the first and second replicate. 1250 Returns 1251 ------- 1252 indn1 1253 numpy array list of indexes of all values of unicountvals_1 1254 indn2 1255 numpy array list of indexes of all values of unicountvals_2 1256 sparse_rep_counts 1257 numpy array, # of clones having the read counts pair {(n1,n2)} 1258 unicountvals_1 1259 numpy array list of unique counts values present in the first sample in df[clone_count_1] 1260 unicountvals_2 1261 numpy array list of unique counts values present in the second sample in df[clone_count_2] 1262 Nreads1 1263 float, total number of counts/reads in the first sample referred in df by "_1" 1264 Nreads2 1265 float, total number of counts/reads in the second sample referred in df by "_2" 1266 """ 1267 1268 counts = df.loc[:,['Clone_count_1', 'Clone_count_2']] 1269 counts['paircount'] = 1 # gives a weight of 1 to each observed clone 1270 1271 clone_counts = counts.groupby(['Clone_count_1', 'Clone_count_2']).sum() 1272 sparse_rep_counts = np.asarray(clone_counts.values.flatten(), dtype=int) 1273 clonecountpair_vals = clone_counts.index.values 1274 indn1 = np.asarray([clonecountpair_vals[it][0] for it in range(len(sparse_rep_counts))], dtype=int) 1275 indn2 = np.asarray([clonecountpair_vals[it][1] for it in range(len(sparse_rep_counts))], dtype=int) 1276 NreadsI = np.sum(counts['Clone_count_1']) 1277 NreadsII = np.sum(counts['Clone_count_2']) 1278 1279 unicountvals_1, indn1 = np.unique(indn1, return_inverse=True) 1280 unicountvals_2, indn2 = np.unique(indn2, return_inverse=True) 1281 1282 return indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII
Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation. unicountvals_1(2) are the unique values of n1(2). sparse_rep_counts gives the counts of unique pairs. ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair. len(indn1)=len(indn2)=len(sparse_rep_counts)
Parameters
- df (pandas data frame): data-frame which is the output of the method .import_data() for one Data_Process instance. these data-frame should give the list of TCR clones present in two replicates RepSeq samples associated to their clone frequencies and clone abundances in the first and second replicate.
Returns
- indn1: numpy array list of indexes of all values of unicountvals_1
- indn2: numpy array list of indexes of all values of unicountvals_2
- sparse_rep_counts: numpy array, # of clones having the read counts pair {(n1,n2)}
- unicountvals_1: numpy array list of unique counts values present in the first sample in df[clone_count_1]
- unicountvals_2: numpy array list of unique counts values present in the second sample in df[clone_count_2]
- Nreads1: float, total number of counts/reads in the first sample referred in df by "_1"
- Nreads2: float, total number of counts/reads in the second sample referred in df by "_2"
1542 def learn_null_model(self, df, noise_model, init_paras, output_dir = None, filename = None, display_loss_function = False): # constraint type 1 gives only low error modes, see paper for details. 1543 """ 1544 Parameters 1545 ---------- 1546 df : pandas data frame 1547 data-frame which is the output of the method .import_data() for one Data_Process instance. 1548 these data-frame should give the list of TCR clones present in two replicates RepSeq samples 1549 associated to their clone frequencies and clone abundances in the first and second replicate. 1550 noise_model: numpy array 1551 choice of noise model 1552 init_paras: numpy array 1553 initial vector of parameters to start the optimization of the model from data (df) 1554 output_dir : str 1555 default value is None, it is the output directory name i which we want to save the values of the parameters 1556 display_loss_function : bool 1557 boolean variable to chose if we want to print the loss function during the experimental noise learning, default value is 1558 None. 1559 1560 Returns 1561 ------- 1562 outstruct 1563 numpy array parameters of the noise model 1564 constr_value 1565 float, value of the constraint 1566 1567 """ 1568 1569 # Data introduction 1570 sparse_rep = self.get_sparserep(df) 1571 constr_type = 1 1572 1573 # Choice of the model: 1574 # Parameters initialization depending on the model 1575 if noise_model < 1: 1576 parameter_labels = ['alph_rho', 'beta', 'alpha', 'm_total', 'fmin'] 1577 elif noise_model == 1: 1578 parameter_labels = ['alph_rho', 'beta', 'alpha', 'fmin'] 1579 else: 1580 parameter_labels = ['alph_rho', 'fmin'] 1581 1582 assert len(parameter_labels) == len(init_paras), "number of model and initial paras differ!" 1583 1584 condict = {'type': 'eq', 'fun': self._nullmodel_constr_fn, 'args': (sparse_rep, noise_model, constr_type)} 1585 1586 1587 partialobjfunc = partial(self._get_Pn1n2, sparse_rep=sparse_rep, noise_model=noise_model) 1588 nullfunctol = 1e-6 1589 nullmaxiter = 200 1590 header = ['Iter'] + parameter_labels 1591 print(''.join(['{' + str(it) + ':9s} ' for it in range(len(init_paras) + 1)]).format(*header)) 1592 1593 global curr_iter 1594 curr_iter = 1 1595 callbackp = partial(self._callback, nparas=len(init_paras), sparse_rep = sparse_rep, noise_model= noise_model) 1596 outstruct = minimize(partialobjfunc, init_paras, method='SLSQP', callback=callbackp, constraints=condict, 1597 options={'ftol': nullfunctol, 'disp': True, 'maxiter': nullmaxiter}) 1598 1599 constr_value = self._nullmodel_constr_fn(outstruct.x, sparse_rep, noise_model, constr_type) 1600 1601 if noise_model < 1: 1602 parameter_labels = ['alph_rho', 'beta', 'alpha', 'm_total', 'fmin'] 1603 d = {'label' : parameter_labels, 'value': outstruct.x} 1604 df = pd.DataFrame(data = d) 1605 elif noise_model == 1: 1606 parameter_labels = ['alph_rho', 'beta', 'alpha', 'fmin'] 1607 d = {'label' : parameter_labels, 'value': outstruct.x} 1608 df = pd.DataFrame(data = d) 1609 else: 1610 parameter_labels = ['alph_rho', 'fmin'] 1611 d = {'label' : parameter_labels, 'value': outstruct.x} 1612 df = pd.DataFrame(data = d) 1613 1614 1615 if (output_dir == None) & (filename == None): 1616 df.to_csv('nullpara' + str(noise_model)+ '.txt', sep = '\t') 1617 1618 elif (output_dir != None) & (filename == None): 1619 df.to_csv(output_dir + '/nullpara' + str(noise_model)+ '.txt', sep = '\t') 1620 1621 else : 1622 df.to_csv(output_dir + '/' + filename + '.txt', sep = '\t') 1623 1624 return outstruct, constr_value
Parameters
- df (pandas data frame): data-frame which is the output of the method .import_data() for one Data_Process instance. these data-frame should give the list of TCR clones present in two replicates RepSeq samples associated to their clone frequencies and clone abundances in the first and second replicate.
- noise_model (numpy array): choice of noise model
- init_paras (numpy array): initial vector of parameters to start the optimization of the model from data (df)
- output_dir (str): default value is None, it is the output directory name i which we want to save the values of the parameters
- display_loss_function (bool): boolean variable to chose if we want to print the loss function during the experimental noise learning, default value is None.
Returns
- outstruct: numpy array parameters of the noise model
- constr_value: float, value of the constraint
1626 def diversity_estimate(self, df, paras, noise_model): 1627 1628 """ 1629 Estimate diversity of the individual repertoire from the experimental noise learning step. 1630 Parameters 1631 ---------- 1632 df : data-frame 1633 The data-frame which has been used to learn the noise model 1634 paras : numpy array 1635 vector containing the noise parameters 1636 noise_model : int 1637 choice of noise model 1638 Returns 1639 ------- 1640 diversity_estimate 1641 float, diversity estimate from the noise model inference. 1642 1643 """ 1644 1645 sparse_rep = self.get_sparserep(df) 1646 1647 indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2, NreadsI, NreadsII = sparse_rep 1648 1649 nfbins = 1200 1650 freq_dtype = float 1651 1652 # Parameters 1653 1654 alpha = paras[0] 1655 fmin = np.power(10,paras[-1]) 1656 1657 # 1658 logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype) 1659 1660 # 1661 1662 logfvec_tmp=deepcopy(logfvec) 1663 1664 logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI,logfvec_tmp, noise_model, paras) 1665 logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII,logfvec_tmp, noise_model, paras) 1666 1667 # for the trapezoid integral methods 1668 1669 dlogfby2=np.diff(logfvec)/2 1670 1671 # Compute P(0,0) for the normalization constraint 1672 integ = np.exp(logrhofvec + logPn2_f[:, 0] + logPn1_f[:, 0] + logfvec) 1673 Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1]) 1674 1675 #print(np.sum(sparse_rep_counts)) 1676 N_obs = np.sum(sparse_rep_counts) 1677 1678 return int(N_obs/(1-Pn0n0))
Estimate diversity of the individual repertoire from the experimental noise learning step.
Parameters
- df (data-frame): The data-frame which has been used to learn the noise model
- paras (numpy array): vector containing the noise parameters
- noise_model (int): choice of noise model
Returns
- diversity_estimate: float, diversity estimate from the noise model inference.
1683class Expansion_Model(): 1684 1685 """ 1686 A class used to build an object associated to methods in order to select significant expanding or 1687 contracting clones from RepSeq samples taken at two different time points. 1688 ... 1689 Methods 1690 ------- 1691 get_sparserep(df) : 1692 get sparse representation of the abundances / frequencies of the TCR clones present in RepSeq samples of both time points. 1693 This changes the data input to fasten the algorithm 1694 expansion_table(outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold): 1695 generate the table of clones that have been significantly detected to be responsive to an acute stimuli. 1696 """ 1697 1698 1699 def get_sparserep(self, df): 1700 """ 1701 Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation. 1702 unicountvals_1(2) are the unique values of n1(2). 1703 sparse_rep_counts gives the counts of unique pairs. 1704 ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair. 1705 len(indn1)=len(indn2)=len(sparse_rep_counts) 1706 Parameters 1707 ---------- 1708 df : pandas data frame 1709 data-frame which is the output of the method .import_data() for one Data_Process instance. 1710 these data-frame should give the list of TCR clones present in two RepSeq samples, talen at two 1711 different time points, associated to their clone frequencies and clone abundances in the first and second replicate? 1712 Returns 1713 ------- 1714 indn1 1715 numpy array list of indexes of all values of unicountvals_1 1716 indn2 1717 numpy array list of indexes of all values of unicountvals_2 1718 sparse_rep_counts 1719 numpy array, # of clones having the read counts pair {(n1,n2)} 1720 unicountvals_1 1721 numpy array list of unique counts values present in the first sample in df[clone_count_1] 1722 unicountvals_2 1723 numpy array list of unique counts values present in the second sample in df[clone_count_2] 1724 Nreads1 1725 float, total number of counts/reads in the first sample referred in df by "_1" for first time point 1726 Nreads2 1727 float, total number of counts/reads in the second sample referred in df by "_2" for second time point 1728 """ 1729 1730 counts = df.loc[:,['Clone_count_1', 'Clone_count_2']] 1731 counts['paircount'] = 1 # gives a weight of 1 to each observed clone 1732 1733 clone_counts = counts.groupby(['Clone_count_1', 'Clone_count_2']).sum() 1734 sparse_rep_counts = np.asarray(clone_counts.values.flatten(), dtype=int) 1735 clonecountpair_vals = clone_counts.index.values 1736 indn1 = np.asarray([clonecountpair_vals[it][0] for it in range(len(sparse_rep_counts))], dtype=int) 1737 indn2 = np.asarray([clonecountpair_vals[it][1] for it in range(len(sparse_rep_counts))], dtype=int) 1738 NreadsI = np.sum(counts['Clone_count_1']) 1739 NreadsII = np.sum(counts['Clone_count_2']) 1740 1741 unicountvals_1, indn1 = np.unique(indn1, return_inverse=True) 1742 unicountvals_2, indn2 = np.unique(indn2, return_inverse=True) 1743 1744 return indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII 1745 1746 1747 1748 def _NegBinPar(self,m,v,mvec): 1749 ''' 1750 Same as NegBinParMtr, but for m and v being scalars. 1751 Assumes m>0. 1752 Output is (len(mvec),) array 1753 ''' 1754 mmax=mvec[-1] 1755 p = 1-m/v 1756 r = m*m/v/p 1757 NBvec=np.arange(mmax+1,dtype=float) 1758 NBvec[1:]=np.log((NBvec[1:]+r-1)/NBvec[1:]*p) #vectorization won't help unfortuneately here since log needs to be over array 1759 NBvec[0]=r*math.log(m/v) 1760 NBvec=np.exp(np.cumsum(NBvec)[mvec]) #save a bit here 1761 return NBvec 1762 1763 1764 def _NegBinParMtr(self,m,v,nvec): #speed up only insofar as the log and exp are called once on array instead of multiple times on rows 1765 ''' 1766 computes NegBin probabilities over the ordered (but possibly discontiguous) vector (nvec) 1767 for mean/variance combinations given by the mean (m) and variance (v) vectors. 1768 Note that m<v for negative binomial. 1769 Output is (len(m),len(nvec)) array 1770 ''' 1771 nmax=nvec[-1] 1772 p = 1-m/v 1773 r = m*m/v/p 1774 NBvec=np.arange(nmax+1,dtype=float) 1775 NBvec=np.log((NBvec+r[:,np.newaxis]-1)*(p[:,np.newaxis]/NBvec)) 1776 NBvec[:,0]=r*np.log(m/v) #handle NBvec[0]=0, treated specially when m[0]=0, see below 1777 NBvec=np.exp(np.cumsum(NBvec,axis=1)) #save a bit here 1778 if m[0]==0: 1779 NBvec[0,:]=0. 1780 NBvec[0,0]=1. 1781 NBvec=NBvec[:,nvec] 1782 return NBvec 1783 1784 def _PoisPar(self, Mvec,unicountvals): 1785 #assert Mvec[0]==0, "first element needs to be zero" 1786 nmax=unicountvals[-1] 1787 nlen=len(unicountvals) 1788 mlen=len(Mvec) 1789 Nvec=unicountvals 1790 logNvec=-np.insert(np.cumsum(np.log(np.arange(1,nmax+1))),0,0.)[unicountvals] #avoid n=0 nans 1791 Nmtr=np.exp(Nvec[np.newaxis,:]*np.log(Mvec)[:,np.newaxis]+logNvec[np.newaxis,:]-Mvec[:,np.newaxis]) # np.log(Mvec) throws warning: since log(0)=-inf 1792 if Mvec[0]==0: 1793 Nmtr[0,:]=np.zeros((nlen,)) #when m=0, n=0, and so get rid of nans from log(0) 1794 Nmtr[0,0]=1. #handled belowacq_model_type 1795 if unicountvals[0]==0: #if n=0 included get rid of nans from log(0) 1796 Nmtr[:,0]=np.exp(-Mvec) 1797 return Nmtr 1798 1799 def _get_rhof(self,alpha_rho, nfbins,fmin,freq_dtype): 1800 ''' 1801 generates power law (power is alpha_rho) clone frequency distribution over 1802 freq_nbins discrete logarithmically spaced frequences between fmin and 1 of dtype freq_dtype 1803 Outputs log probabilities obtained at log frequencies''' 1804 fmax=1e0 1805 logfvec=np.linspace(np.log10(fmin),np.log10(fmax), nfbins) 1806 logfvec=np.array(np.log(np.power(10,logfvec)) ,dtype=freq_dtype).flatten() 1807 logrhovec=logfvec*alpha_rho 1808 integ=np.exp(logrhovec+logfvec,dtype=freq_dtype) 1809 normconst=np.log(np.dot(np.diff(logfvec)/2.,integ[1:]+integ[:-1])) 1810 logrhovec-=normconst 1811 return logrhovec,logfvec 1812 1813 1814 def _get_logPn_f(self,unicounts,Nreads,logfvec, noise_model, paras): 1815 1816 """ 1817 tools to compute the likelihood of the noise model. It is not useful for the user. 1818 """ 1819 1820 # Choice of the model: 1821 1822 if noise_model<1: 1823 1824 m_total=float(np.power(10, paras[3])) 1825 r_c=Nreads/m_total 1826 if noise_model<2: 1827 1828 beta_mv= paras[1] 1829 alpha_mv=paras[2] 1830 1831 if noise_model<1: #for models that include cell counts 1832 #compute parametrized range (mean-sigma,mean+5*sigma) of m values (number of cells) conditioned on n values (reads) appearing in the data only 1833 nsigma=5. 1834 nmin=300. 1835 #for each n, get actual range of m to compute around n-dependent mean m 1836 m_low =np.zeros((len(unicounts),),dtype=int) 1837 m_high=np.zeros((len(unicounts),),dtype=int) 1838 for nit,n in enumerate(unicounts): 1839 mean_m=n/r_c 1840 dev=nsigma*np.sqrt(mean_m) 1841 m_low[nit] =int(mean_m- dev) if (mean_m>dev**2) else 0 1842 m_high[nit]=int(mean_m+5*dev) if ( n>nmin) else int(10*nmin/r_c) 1843 m_cellmax=np.max(m_high) 1844 #across n, collect all in-range m 1845 mvec_bool=np.zeros((m_cellmax+1,),dtype=bool) #cheap bool 1846 nvec=range(len(unicounts)) 1847 for nit in nvec: 1848 mvec_bool[m_low[nit]:m_high[nit]+1]=True #mask vector 1849 mvec=np.arange(m_cellmax+1)[mvec_bool] 1850 #transform to in-range index 1851 for nit in nvec: 1852 m_low[nit]=np.where(m_low[nit]==mvec)[0][0] 1853 m_high[nit]=np.where(m_high[nit]==mvec)[0][0] 1854 1855 Pn_f=np.zeros((len(logfvec),len(unicounts))) 1856 if noise_model==0: 1857 1858 mean_m=m_total*np.exp(logfvec) 1859 var_m=mean_m+beta_mv*np.power(mean_m,alpha_mv) 1860 Poisvec = self._PoisPar(mvec*r_c,unicounts) 1861 for f_it in range(len(logfvec)): 1862 NBvec=self._NegBinPar(mean_m[f_it],var_m[f_it],mvec) 1863 for n_it,n in enumerate(unicounts): 1864 Pn_f[f_it,n_it]=np.dot(NBvec[m_low[n_it]:m_high[n_it]+1],Poisvec[m_low[n_it]:m_high[n_it]+1,n_it]) 1865 1866 elif noise_model==1: 1867 1868 mean_n=Nreads*np.exp(logfvec) 1869 var_n=mean_n+beta_mv*np.power(mean_n,alpha_mv) 1870 Pn_f = self._NegBinParMtr(mean_n,var_n,unicounts) 1871 elif noise_model==2: 1872 1873 mean_n=Nreads*np.exp(logfvec) 1874 Pn_f= self._PoisPar(mean_n,unicounts) 1875 else: 1876 print('acq_model is 0,1,or 2 only') 1877 1878 return np.log(Pn_f) 1879 1880 def _get_Ps(self, alp,sbar,smax,stp): 1881 ''' 1882 generates symmetric exponential distribution over log fold change 1883 with effect size sbar and nonresponding fraction 1-alp at s=0. 1884 computed over discrete range of s from -smax to smax in steps of size stp 1885 ''' 1886 lamb=-stp/sbar 1887 smaxt=round(smax/stp) 1888 s_zeroind=int(smaxt) 1889 Z=2*(np.exp((smaxt+1)*lamb)-1)/(np.exp(lamb)-1)-1 1890 Ps=alp*np.exp(lamb*np.fabs(np.arange(-smaxt,smaxt+1)))/Z 1891 Ps[s_zeroind]+=(1-alp) 1892 return Ps 1893 1894 def _callbackFdiffexpr(self, Xi): #case dependent 1895 '''prints iteration info. called scipy.minimize''' 1896 1897 print('{0: 3.6f} {1: 3.6f} '.format(Xi[0], Xi[1])+'\n') 1898 1899 1900 def _learning_dynamics_expansion_polished(self, df, paras_1, paras_2, noise_model): 1901 """ 1902 function to infer the expansion mode parameters - not usable by the user. 1903 """ 1904 1905 indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2,NreadsI,NreadsII = self.get_sparserep(df) 1906 1907 alpha_rho = paras_1[0] 1908 fmin = np.power(10,paras_1[-1]) 1909 freq_dtype = 'float64' 1910 nfbins = 1200 #Accuracy of the integration 1911 1912 1913 logrhofvec, logfvec = get_rhof(self, alpha_rho, nfbins, fmin, freq_dtype) 1914 1915 #Definition of svec 1916 smax = 25.0 #maximum absolute logfold change value 1917 s_step = 0.1 1918 s_0 = -1 1919 1920 s_step_old= s_step 1921 logf_step= logfvec[1] - logfvec[0] #use natural log here since f2 increments in increments in exp(). 1922 f2s_step= int(round(s_step/logf_step)) #rounded number of f-steps in one s-step 1923 s_step= float(f2s_step)*logf_step 1924 smax= s_step*(smax/s_step_old) 1925 svec= s_step*np.arange(0,int(round(smax/s_step)+1)) 1926 svec= np.append(-svec[1:][::-1],svec) 1927 1928 smaxind=(len(svec)-1)/2 1929 f2s_step=int(round(s_step/logf_step)) #rounded number of f-steps in one s-step 1930 logfmin=logfvec[0 ]-f2s_step*smaxind*logf_step 1931 logfmax=logfvec[-1]+f2s_step*smaxind*logf_step 1932 1933 logfvecwide = np.linspace(logfmin,logfmax,len(logfvec)+2*smaxind*f2s_step) #a wider domain for the second frequency f2=f1*exp(s) 1934 1935 # Compute P(n1|f) and P(n2|f), each in an iteration of the following loop 1936 1937 for it in range(2): 1938 if it == 0: 1939 unicounts=unicountvals_1 1940 logfvec_tmp=deepcopy(logfvec) 1941 Nreads = NreadsI 1942 paras = paras_1 1943 else: 1944 unicounts=unicountvals_2 1945 logfvec_tmp=deepcopy(logfvecwide) #contains s-shift for sampled data method 1946 Nreads = NreadsII 1947 paras = paras_2 1948 if it == 0: 1949 logPn1_f = self._get_logPn_f( unicounts, Nreads, logfvec_tmp, noise_model, paras) 1950 1951 else: 1952 logPn2_f = self._get_logPn_f(unicounts, Nreads, logfvec_tmp, noise_model, paras) 1953 1954 #for the trapezoid method 1955 dlogfby2=np.diff(logfvec)/2 1956 1957 # Computing P(n1,n2|f,s) 1958 Pn1n2_s=np.zeros((len(svec), len(unicountvals_1), len(unicountvals_2))) 1959 1960 for s_it,s in enumerate(svec): 1961 for it,(n1_it, n2_it) in enumerate(zip(indn1,indn2)): 1962 integ = np.exp(logrhofvec+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),n2_it]+logPn1_f[:,n1_it]+ logfvec ) 1963 Pn1n2_s[s_it, n1_it, n2_it] = np.dot(dlogfby2,integ[1:] + integ[:-1]) 1964 1965 1966 Pn0n0_s = np.zeros(svec.shape) 1967 for s_it,s in enumerate(svec): 1968 integ=np.exp(logPn1_f[:,0]+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),0]+logrhofvec+logfvec) 1969 Pn0n0_s[s_it]=np.dot(dlogfby2,integ[1:]+integ[:-1]) 1970 1971 1972 N_obs = np.sum(sparse_rep_counts) 1973 print("N_obs: " + str(N_obs)) 1974 1975 1976 def cost(PARAS): 1977 1978 alp = PARAS[0] 1979 sbar = PARAS[1] 1980 1981 Ps = _get_Ps(self,alp,sbar,smax,s_step) 1982 Pn0n0=np.dot(Pn0n0_s,Ps) 1983 Pn1n2_ps=np.sum(Pn1n2_s*Ps[:,np.newaxis,np.newaxis],0) 1984 Pn1n2_ps/=1-Pn0n0 1985 print(Pn0n0) 1986 1987 1988 1989 Energy = - np.dot(sparse_rep_counts/float(N_obs),np.where(Pn1n2_ps[indn1,indn2]>0,np.log(Pn1n2_ps[indn1,indn2]),0)) 1990 1991 return Energy 1992 1993 #--------------------------Compute-the-grid----------------------------------------- 1994 1995 print('Calculation Surface : \n') 1996 st = time.time() 1997 1998 npoints = 20 #to be chosen by the user 1999 alpvec = np.logspace(-3,np.log10(0.99), npoints) 2000 sbarvec = np.linspace(0.01,5, npoints) 2001 2002 LSurface =np.zeros((len(sbarvec),len(alpvec))) 2003 for i in range(len(sbarvec)): 2004 for j in range(len(alpvec)): 2005 LSurface[i, j]= - cost([alpvec[j], sbarvec[i]]) 2006 2007 alpmesh, sbarmesh = np.meshgrid(alpvec, sbarvec) 2008 a,b = np.where(LSurface == np.max(LSurface)) 2009 print("--- %s seconds ---" % (time.time() - st)) 2010 2011 2012 #------------------------------Optimization---------------------------------------------- 2013 2014 optA = alpmesh[a[0],b[0]] 2015 optB = sbarmesh[a[0],b[0]] 2016 2017 print('polish parameter estimate from '+ str(optA)+' '+str(optB)) 2018 initparas=(optA,optB) 2019 2020 2021 outstruct = minimize(cost, initparas, method='SLSQP', callback=_callbackFdiffexpr, tol=1e-6,options={'ftol':1e-8 ,'disp': True,'maxiter':300}) 2022 2023 return outstruct.x, Pn1n2_s, Pn0n0_s, svec 2024 2025 def _learning_dynamics_expansion(self, sparse_rep, paras_1, paras_2, noise_model, display_plot=False): 2026 """ 2027 function to infer the expansion mode parameters - not usable by the user. 2028 """ 2029 2030 indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2,NreadsI,NreadsII = sparse_rep 2031 2032 alpha_rho = paras_1[0] 2033 fmin = np.power(10,paras_1[-1]) 2034 freq_dtype = 'float64' 2035 nfbins = 1200 #Accuracy of the integration 2036 2037 2038 logrhofvec, logfvec = self.get_rhof(alpha_rho, nfbins, fmin, freq_dtype) 2039 2040 #Definition of svec 2041 smax = 25.0 #maximum absolute logfold change value 2042 s_step = 0.1 2043 s_0 = -1 2044 2045 s_step_old= s_step 2046 logf_step= logfvec[1] - logfvec[0] #use natural log here since f2 increments in increments in exp(). 2047 f2s_step= int(round(s_step/logf_step)) #rounded number of f-steps in one s-step 2048 s_step= float(f2s_step)*logf_step 2049 smax= s_step*(smax/s_step_old) 2050 svec= s_step*np.arange(0,int(round(smax/s_step)+1)) 2051 svec= np.append(-svec[1:][::-1],svec) 2052 2053 smaxind=(len(svec)-1)/2 2054 f2s_step=int(round(s_step/logf_step)) #rounded number of f-steps in one s-step 2055 logfmin=logfvec[0 ]-f2s_step*smaxind*logf_step 2056 logfmax=logfvec[-1]+f2s_step*smaxind*logf_step 2057 2058 logfvecwide = np.linspace(logfmin,logfmax,int(len(logfvec)+2*smaxind*f2s_step)) #a wider domain for the second frequency f2=f1*exp(s) 2059 2060 # Compute P(n1|f) and P(n2|f), each in an iteration of the following loop 2061 2062 for it in range(2): 2063 if it == 0: 2064 unicounts=unicountvals_1 2065 logfvec_tmp=deepcopy(logfvec) 2066 Nreads = NreadsI 2067 paras = paras_1 2068 else: 2069 unicounts=unicountvals_2 2070 logfvec_tmp=deepcopy(logfvecwide) #contains s-shift for sampled data method 2071 Nreads = NreadsII 2072 paras = paras_2 2073 if it == 0: 2074 logPn1_f = self._get_logPn_f(unicounts, Nreads, logfvec_tmp, noise_model, paras) 2075 2076 else: 2077 logPn2_f = self._get_logPn_f(unicounts, Nreads, logfvec_tmp, noise_model, paras) 2078 2079 #for the trapezoid method 2080 dlogfby2=np.diff(logfvec)/2 2081 2082 # Computing P(n1,n2|f,s) 2083 Pn1n2_s=np.zeros((len(svec), len(unicountvals_1), len(unicountvals_2))) 2084 2085 for s_it,s in enumerate(svec): 2086 for it,(n1_it, n2_it) in enumerate(zip(indn1,indn2)): 2087 integ = np.exp(logrhofvec+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),n2_it]+logPn1_f[:,n1_it]+ logfvec ) 2088 Pn1n2_s[s_it, n1_it, n2_it] = np.dot(dlogfby2,integ[1:] + integ[:-1]) 2089 2090 2091 Pn0n0_s = np.zeros(svec.shape) 2092 for s_it,s in enumerate(svec): 2093 integ=np.exp(logPn1_f[:,0]+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),0]+logrhofvec+logfvec) 2094 Pn0n0_s[s_it]=np.dot(dlogfby2,integ[1:]+integ[:-1]) 2095 2096 2097 N_obs = np.sum(sparse_rep_counts) 2098 print("N_obs: " + str(N_obs)) 2099 2100 2101 def cost(PARAS): 2102 2103 alp = PARAS[0] 2104 sbar = PARAS[1] 2105 2106 Ps = self._get_Ps(alp,sbar,smax,s_step) 2107 Pn0n0=np.dot(Pn0n0_s,Ps) 2108 Pn1n2_ps=np.sum(Pn1n2_s*Ps[:,np.newaxis,np.newaxis],0) 2109 Pn1n2_ps/=1-Pn0n0 2110 #print(Pn0n0) 2111 2112 2113 2114 Energy = - np.dot(sparse_rep_counts/float(N_obs),np.where(Pn1n2_ps[indn1,indn2]>0,np.log(Pn1n2_ps[indn1,indn2]),0)) 2115 2116 return Energy 2117 2118 #--------------------------Compute-the-grid----------------------------------------- 2119 2120 print('Calculation Surface : \n') 2121 st = time.time() 2122 2123 npoints = 50 #to be chosen by the user 2124 alpvec = np.logspace(-3,np.log10(0.99), npoints) 2125 sbarvec = np.linspace(0.01,5, npoints) 2126 2127 LSurface =np.zeros((len(sbarvec),len(alpvec))) 2128 for i in range(len(sbarvec)): 2129 for j in range(len(alpvec)): 2130 LSurface[i, j]= - cost([alpvec[j], sbarvec[i]]) 2131 2132 alpmesh, sbarmesh = np.meshgrid(alpvec, sbarvec) 2133 a,b = np.where(LSurface == np.max(LSurface)) 2134 print("--- %s seconds ---" % (time.time() - st)) 2135 2136 #---------------------------Plot-the-grid------------------------------------------- 2137 if display_plot: 2138 2139 fig, ax =plt.subplots(1, figsize=(10,8)) 2140 2141 2142 a,b = np.where(LSurface == np.max(LSurface)) 2143 2144 ax.contour(alpmesh, sbarmesh, LSurface, linewidths=1, colors='k', linestyles = 'solid') 2145 plt.contourf(alpmesh, sbarmesh, LSurface, 20, cmap = 'viridis', alpha= 0.8) 2146 2147 xmax = alpmesh[a[0],b[0]] 2148 ymax = sbarmesh[a[0],b[0]] 2149 text= r"$ alpha={:.3f}, s={:.3f} $".format(xmax, ymax) 2150 bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72) 2151 arrowprops=dict(arrowstyle="->",connectionstyle="angle,angleA=0,angleB=80") 2152 kw = dict(xycoords='data',textcoords="axes fraction", 2153 arrowprops=arrowprops, bbox=bbox_props, ha="right", va="top") 2154 plt.annotate(text, xy=(xmax, ymax), xytext=(0.94,0.96), **kw) 2155 plt.xlabel(r'$ \alpha, \ size \ of \ the \ repertoire \ that \ answers \ to \ the \ vaccine $') 2156 plt.ylabel(r'$ s_{bar}, \ characteristic \ expansion \ decrease $') 2157 plt.xscale('log') 2158 plt.yscale('log') 2159 plt.grid() 2160 plt.title(r'$Grid \ Search \ graph \ for \ \alpha \ and \ s_{bar} \ parameters. $') 2161 plt.colorbar() 2162 2163 return LSurface, Pn1n2_s, Pn0n0_s, svec 2164 2165 2166 def _save_table(self, outpath, svec, Ps,Pn1n2_s, Pn0n0_s, subset, unicountvals_1_d, unicountvals_2_d, indn1_d, indn2_d, print_expanded, pthresh, smedthresh): 2167 ''' 2168 takes learned diffexpr model, Pn1n2_s*Ps, computes posteriors over (n1,n2) pairs, and writes to file a table of data with clones as rows and columns as measures of thier posteriors 2169 print_expanded=True orders table as ascending by , else descending 2170 pthresh is the threshold in 'p-value'-like (null hypo) probability, 1-P(s>0|n1_i,n2_i), where i is the row (i.e. the clone) n.b. lower null prob implies larger probability of expansion 2171 smedthresh is the threshold on the posterior median, below which clones are discarded 2172 not usable by the user. 2173 ''' 2174 2175 Psn1n2_ps=Pn1n2_s*Ps[:,np.newaxis,np.newaxis] 2176 2177 #compute marginal likelihood (neglect renormalization , since it cancels in conditional below) 2178 Pn1n2_ps=np.sum(Psn1n2_ps,0) 2179 2180 Ps_n1n2ps=Pn1n2_s*Ps[:,np.newaxis,np.newaxis]/Pn1n2_ps[np.newaxis,:,:] 2181 #compute cdf to get p-value to threshold on to reduce output size 2182 cdfPs_n1n2ps=np.cumsum(Ps_n1n2ps,0) 2183 2184 2185 def dummy(row,cdfPs_n1n2ps,unicountvals_1_d,unicountvals_2_d): 2186 ''' 2187 when applied to dataframe, generates 'p-value'-like (null hypo) probability, 1-P(s>0|n1_i,n2_i), where i is the row (i.e. the clone) 2188 ''' 2189 return cdfPs_n1n2ps[np.argmin(np.fabs(svec)),row['Clone_count_1']==unicountvals_1_d,row['Clone_count_2']==unicountvals_2_d][0] 2190 dummy_part=partial(dummy,cdfPs_n1n2ps=cdfPs_n1n2ps,unicountvals_1_d=unicountvals_1_d,unicountvals_2_d=unicountvals_2_d) 2191 2192 cdflabel=r'$1-P(s>0)$' 2193 subset[cdflabel]=subset.apply(dummy_part, axis=1) 2194 subset=subset[subset[cdflabel]<pthresh].reset_index(drop=True) 2195 2196 #go from clone count pair (n1,n2) to index in unicountvals_1_d and unicountvals_2_d 2197 data_pairs_ind_1=np.zeros((len(subset),),dtype=int) 2198 data_pairs_ind_2=np.zeros((len(subset),),dtype=int) 2199 for it in range(len(subset)): 2200 data_pairs_ind_1[it]=np.where(int(subset.iloc[it].Clone_count_1)==unicountvals_1_d)[0] 2201 data_pairs_ind_2[it]=np.where(int(subset.iloc[it].Clone_count_2)==unicountvals_2_d)[0] 2202 #posteriors over data clones 2203 Ps_n1n2ps_datpairs=Ps_n1n2ps[:,data_pairs_ind_1,data_pairs_ind_2] 2204 2205 #compute posterior metrics 2206 mean_est=np.zeros((len(subset),)) 2207 max_est= np.zeros((len(subset),)) 2208 slowvec= np.zeros((len(subset),)) 2209 smedvec= np.zeros((len(subset),)) 2210 shighvec=np.zeros((len(subset),)) 2211 pval=0.025 #double-sided comparison statistical test 2212 pvalvec=[pval,0.5,1-pval] #bound criteria defining slow, smed, and shigh, respectively 2213 for it,column in enumerate(np.transpose(Ps_n1n2ps_datpairs)): 2214 mean_est[it]=np.sum(svec*column) 2215 max_est[it]=svec[np.argmax(column)] 2216 forwardcmf=np.cumsum(column) 2217 backwardcmf=np.cumsum(column[::-1])[::-1] 2218 inds=np.where((forwardcmf[:-1]<pvalvec[0]) & (forwardcmf[1:]>=pvalvec[0]))[0] 2219 slowvec[it]=np.mean(svec[inds+np.ones((len(inds),),dtype=int)]) #use mean in case there are two values 2220 inds=np.where((forwardcmf>=pvalvec[1]) & (backwardcmf>=pvalvec[1]))[0] 2221 smedvec[it]=np.mean(svec[inds]) 2222 inds=np.where((forwardcmf[:-1]<pvalvec[2]) & (forwardcmf[1:]>=pvalvec[2]))[0] 2223 shighvec[it]=np.mean(svec[inds+np.ones((len(inds),),dtype=int)]) 2224 2225 colnames=(r'$\bar{s}$',r'$s_{max}$',r'$s_{3,high}$',r'$s_{2,med}$',r'$s_{1,low}$') 2226 for it,coldata in enumerate((mean_est,max_est,shighvec,smedvec,slowvec)): 2227 subset.insert(0,colnames[it],coldata) 2228 oldcolnames=( 'AACDR3', 'ntCDR3', 'Clone_count_1', 'Clone_count_2', 'Clone_fraction_1', 'Clone_fraction_2') 2229 newcolnames=('CDR3_AA', 'CDR3_nt', r'$n_1$', r'$n_2$', r'$f_1$', r'$f_2$') 2230 subset=subset.rename(columns=dict(zip(oldcolnames, newcolnames))) 2231 2232 #select only clones whose posterior median pass the given threshold 2233 subset=subset[subset[r'$s_{2,med}$']>smedthresh] 2234 2235 print("writing to: "+outpath) 2236 if print_expanded: 2237 subset=subset.sort_values(by=cdflabel,ascending=True) 2238 strout='expanded' 2239 else: 2240 subset=subset.sort_values(by=cdflabel,ascending=False) 2241 strout='contracted' 2242 subset.to_csv(outpath+'top_'+strout+'.csv',sep='\t',index=False) 2243 2244 2245 2246 def expansion_table(self, outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold): 2247 2248 ''' 2249 generate the table of clones that have been significantly detected to be responsive to an acute stimuli. 2250 2251 Parameters 2252 ---------- 2253 outpath : str 2254 Name of the directory where to store the output table 2255 paras_1 : numpy array 2256 parameters of the noise model that has been learned at time_1 2257 paras_2 : numpy array 2258 parameters of the noise model that has been learned at time_2 2259 df : pandas dataframe 2260 pandas dataframe merging the two RepSeq data at time_1 and time_2 2261 noise_model : int 2262 choice of noise model 0: Poisson, 1: negative Binomial, 2: negative Binomial + Poisson 2263 pval_threshold : float 2264 P-value threshold to detect and discriminate if a TCR clone has expanded 2265 smed_threshold : float 2266 median of the log-fold change threshold to detect if a TCR clone has expanded 2267 Returns 2268 ------- 2269 data-frame - csv file 2270 the output is a csv file of columns : $s_{1-low}$, $s_{2-med}$, $s_{3-high}$, $s_{max}$, $\bar{s}$, $f_1$, $f_2$, $n_1$, $n_2$, 'CDR3_nt', 'CDR3_AA' and '$p$-value' 2271 ''' 2272 2273 sparse_rep = self.get_sparserep(df) 2274 L_surface, Pn1n2_s_d, Pn0n0_s_d, svec = self._learning_dynamics_expansion(sparse_rep, paras_1, paras_2, noise_model) 2275 npoints= 50 # same as in learning_dynamics_expansion 2276 smax = 25.0 2277 s_step = 0.1 2278 alpvec = np.logspace(-3,np.log10(0.99), npoints) 2279 sbarvec = np.linspace(0.01,5, npoints) 2280 maxinds=np.unravel_index(np.argmax(L_surface),np.shape(L_surface)) 2281 optsbar=sbarvec[maxinds[0]] 2282 optalp=alpvec[maxinds[1]] 2283 optPs= self._get_Ps(optalp,optsbar,smax,s_step) 2284 pval_expanded = True 2285 2286 indn1,indn2,sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII = sparse_rep 2287 2288 self._save_table(outpath, svec, optPs, Pn1n2_s_d, Pn0n0_s_d, df, unicountvals_1, unicountvals_2, indn1, indn2, pval_expanded, pval_threshold, smed_threshold)
A class used to build an object associated to methods in order to select significant expanding or contracting clones from RepSeq samples taken at two different time points. ...
Methods
get_sparserep(df) : get sparse representation of the abundances / frequencies of the TCR clones present in RepSeq samples of both time points. This changes the data input to fasten the algorithm expansion_table(outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold): generate the table of clones that have been significantly detected to be responsive to an acute stimuli.
1699 def get_sparserep(self, df): 1700 """ 1701 Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation. 1702 unicountvals_1(2) are the unique values of n1(2). 1703 sparse_rep_counts gives the counts of unique pairs. 1704 ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair. 1705 len(indn1)=len(indn2)=len(sparse_rep_counts) 1706 Parameters 1707 ---------- 1708 df : pandas data frame 1709 data-frame which is the output of the method .import_data() for one Data_Process instance. 1710 these data-frame should give the list of TCR clones present in two RepSeq samples, talen at two 1711 different time points, associated to their clone frequencies and clone abundances in the first and second replicate? 1712 Returns 1713 ------- 1714 indn1 1715 numpy array list of indexes of all values of unicountvals_1 1716 indn2 1717 numpy array list of indexes of all values of unicountvals_2 1718 sparse_rep_counts 1719 numpy array, # of clones having the read counts pair {(n1,n2)} 1720 unicountvals_1 1721 numpy array list of unique counts values present in the first sample in df[clone_count_1] 1722 unicountvals_2 1723 numpy array list of unique counts values present in the second sample in df[clone_count_2] 1724 Nreads1 1725 float, total number of counts/reads in the first sample referred in df by "_1" for first time point 1726 Nreads2 1727 float, total number of counts/reads in the second sample referred in df by "_2" for second time point 1728 """ 1729 1730 counts = df.loc[:,['Clone_count_1', 'Clone_count_2']] 1731 counts['paircount'] = 1 # gives a weight of 1 to each observed clone 1732 1733 clone_counts = counts.groupby(['Clone_count_1', 'Clone_count_2']).sum() 1734 sparse_rep_counts = np.asarray(clone_counts.values.flatten(), dtype=int) 1735 clonecountpair_vals = clone_counts.index.values 1736 indn1 = np.asarray([clonecountpair_vals[it][0] for it in range(len(sparse_rep_counts))], dtype=int) 1737 indn2 = np.asarray([clonecountpair_vals[it][1] for it in range(len(sparse_rep_counts))], dtype=int) 1738 NreadsI = np.sum(counts['Clone_count_1']) 1739 NreadsII = np.sum(counts['Clone_count_2']) 1740 1741 unicountvals_1, indn1 = np.unique(indn1, return_inverse=True) 1742 unicountvals_2, indn2 = np.unique(indn2, return_inverse=True) 1743 1744 return indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII
Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation. unicountvals_1(2) are the unique values of n1(2). sparse_rep_counts gives the counts of unique pairs. ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair. len(indn1)=len(indn2)=len(sparse_rep_counts)
Parameters
- df (pandas data frame): data-frame which is the output of the method .import_data() for one Data_Process instance. these data-frame should give the list of TCR clones present in two RepSeq samples, talen at two different time points, associated to their clone frequencies and clone abundances in the first and second replicate?
Returns
- indn1: numpy array list of indexes of all values of unicountvals_1
- indn2: numpy array list of indexes of all values of unicountvals_2
- sparse_rep_counts: numpy array, # of clones having the read counts pair {(n1,n2)}
- unicountvals_1: numpy array list of unique counts values present in the first sample in df[clone_count_1]
- unicountvals_2: numpy array list of unique counts values present in the second sample in df[clone_count_2]
- Nreads1: float, total number of counts/reads in the first sample referred in df by "_1" for first time point
- Nreads2: float, total number of counts/reads in the second sample referred in df by "_2" for second time point
2246 def expansion_table(self, outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold): 2247 2248 ''' 2249 generate the table of clones that have been significantly detected to be responsive to an acute stimuli. 2250 2251 Parameters 2252 ---------- 2253 outpath : str 2254 Name of the directory where to store the output table 2255 paras_1 : numpy array 2256 parameters of the noise model that has been learned at time_1 2257 paras_2 : numpy array 2258 parameters of the noise model that has been learned at time_2 2259 df : pandas dataframe 2260 pandas dataframe merging the two RepSeq data at time_1 and time_2 2261 noise_model : int 2262 choice of noise model 0: Poisson, 1: negative Binomial, 2: negative Binomial + Poisson 2263 pval_threshold : float 2264 P-value threshold to detect and discriminate if a TCR clone has expanded 2265 smed_threshold : float 2266 median of the log-fold change threshold to detect if a TCR clone has expanded 2267 Returns 2268 ------- 2269 data-frame - csv file 2270 the output is a csv file of columns : $s_{1-low}$, $s_{2-med}$, $s_{3-high}$, $s_{max}$, $\bar{s}$, $f_1$, $f_2$, $n_1$, $n_2$, 'CDR3_nt', 'CDR3_AA' and '$p$-value' 2271 ''' 2272 2273 sparse_rep = self.get_sparserep(df) 2274 L_surface, Pn1n2_s_d, Pn0n0_s_d, svec = self._learning_dynamics_expansion(sparse_rep, paras_1, paras_2, noise_model) 2275 npoints= 50 # same as in learning_dynamics_expansion 2276 smax = 25.0 2277 s_step = 0.1 2278 alpvec = np.logspace(-3,np.log10(0.99), npoints) 2279 sbarvec = np.linspace(0.01,5, npoints) 2280 maxinds=np.unravel_index(np.argmax(L_surface),np.shape(L_surface)) 2281 optsbar=sbarvec[maxinds[0]] 2282 optalp=alpvec[maxinds[1]] 2283 optPs= self._get_Ps(optalp,optsbar,smax,s_step) 2284 pval_expanded = True 2285 2286 indn1,indn2,sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII = sparse_rep 2287 2288 self._save_table(outpath, svec, optPs, Pn1n2_s_d, Pn0n0_s_d, df, unicountvals_1, unicountvals_2, indn1, indn2, pval_expanded, pval_threshold, smed_threshold)
generate the table of clones that have been significantly detected to be responsive to an acute stimuli.
Parameters
- outpath (str): Name of the directory where to store the output table
- paras_1 (numpy array): parameters of the noise model that has been learned at time_1
- paras_2 (numpy array): parameters of the noise model that has been learned at time_2
- df (pandas dataframe): pandas dataframe merging the two RepSeq data at time_1 and time_2
- noise_model (int): choice of noise model 0: Poisson, 1: negative Binomial, 2: negative Binomial + Poisson
- pval_threshold (float): P-value threshold to detect and discriminate if a TCR clone has expanded
- smed_threshold (float): median of the log-fold change threshold to detect if a TCR clone has expanded
Returns
- data-frame - csv file: the output is a csv file of columns : $s_{1-low}$, $s_{2-med}$, $s_{3-high}$, $s_{max}$, $ar{s}$, $f_1$, $f_2$, $n_1$, $n_2$, 'CDR3_nt', 'CDR3_AA' and '$p$-value'
2293class Generator: 2294 2295 """ 2296 A class used to build an object to generate in-Silico (synthetic) RepSeq samples, in the case of replicates at 2297 the same day and in the case of having 2 samples generated at an initial time for the first one and some time after (months, years) 2298 for the second one using the geometric Brownian motion model decribed in https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 2299 ... 2300 Methods 2301 ------- 2302 gen_synthetic_data_Null(paras, noise_model, NreadsI,NreadsII,Nsamp): 2303 generate in-silico same day RepSeq replicates. 2304 generate_trajectories(tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI = '1e6', NreadsII = '1e6'): 2305 generate in-silico t_ime apart RepSeq samples. 2306 """ 2307 2308 def _get_rhof(self, alpha_rho, fmin, freq_nbins=800, freq_dtype='float64'): 2309 2310 ''' 2311 generates power law (power is alpha_rho) clone frequency distribution over 2312 freq_nbins discrete logarithmically spaced frequences between fmin and 1 of dtype freq_dtype 2313 Outputs log probabilities obtained at log frequencies''' 2314 fmax=1e0 2315 logfvec=np.linspace(np.log10(fmin),np.log10(fmax),freq_nbins) 2316 logfvec=np.array(np.log(np.power(10,logfvec)) ,dtype=freq_dtype).flatten() 2317 logrhovec=logfvec*alpha_rho 2318 integ=np.exp(logrhovec+logfvec,dtype=freq_dtype) 2319 normconst=np.log(np.dot(np.diff(logfvec)/2.,integ[1:]+integ[:-1])) 2320 logrhovec-=normconst 2321 return logrhovec,logfvec 2322 2323 def _get_distsample(self, pmf,Nsamp,dtype='uint32'): 2324 ''' 2325 generates Nsamp index samples of dtype (e.g. uint16 handles up to 65535 indices) from discrete probability mass function pmf. 2326 Handles multi-dimensional domain. N.B. Output is sorted. 2327 ''' 2328 #assert np.sum(pmf)==1, "cmf not normalized!" 2329 2330 shape = np.shape(pmf) 2331 sortindex = np.argsort(pmf, axis=None)#uses flattened array 2332 pmf = pmf.flatten() 2333 pmf = pmf[sortindex] 2334 cmf = np.cumsum(pmf) 2335 choice = np.random.uniform(high = cmf[-1], size = int(float(Nsamp))) 2336 index = np.searchsorted(cmf, choice) 2337 index = sortindex[index] 2338 index = np.unravel_index(index, shape) 2339 index = np.transpose(np.vstack(index)) 2340 sampled_inds = np.array(index[np.argsort(index[:,0])],dtype=dtype) 2341 return sampled_inds 2342 2343 2344 def gen_synthetic_data_Null(self, paras, noise_model, NreadsI,NreadsII,Nsamp): 2345 ''' 2346 outputs an array of observed clone frequencies and corresponding dataframe of pair counts 2347 for a null model learned from a dataset pair with NreadsI and NreadsII number of reads, respectively. 2348 Crucial for RAM efficiency, sampling is conditioned on being observed in each of the three (n,0), (0,n'), and n,n'>0 conditions 2349 so that only Nsamp clones need to be sampled, rather than the N clones in the repertoire. 2350 Note that no explicit normalization is applied. It is assumed that the values in paras are consistent with N<f>=1 2351 (e.g. were obtained through the learning done in this package). 2352 ''' 2353 2354 2355 alpha = paras[0] #power law exponent 2356 fmin=np.power(10,paras[-1]) 2357 if noise_model<1: 2358 m_total=float(np.power(10, paras[3])) 2359 r_c1=NreadsI/m_total 2360 r_c2=NreadsII/m_total 2361 r_cvec=[r_c1,r_c2] 2362 if noise_model<2: 2363 beta_mv= paras[1] 2364 alpha_mv=paras[2] 2365 2366 logrhofvec,logfvec = self.get_rhof(alpha,fmin) 2367 fvec=np.exp(logfvec) 2368 dlogf=np.diff(logfvec)/2. 2369 2370 #generate measurement model distribution, Pn_f 2371 Pn_f=np.empty((len(logfvec),),dtype=object) #len(logfvec) samplers 2372 2373 #get value at n=0 to use for conditioning on n>0 (and get full Pn_f here if noise_model=1,2) 2374 m_max=1e3 #conditioned on n=0, so no edge effects 2375 2376 Nreadsvec=(NreadsI,NreadsII) 2377 for it in range(2): 2378 Pn_f=np.empty((len(fvec),),dtype=object) 2379 if noise_model==2: 2380 m1vec=Nreadsvec[it]*fvec 2381 for find,m1 in enumerate(m1vec): 2382 Pn_f[find]=poisson(m1) 2383 logPn0_f=-m1vec 2384 elif noise_model==1: 2385 m1=Nreadsvec[it]*fvec 2386 v1=m1+beta_mv*np.power(m1,alpha_mv) 2387 p=1-m1/v1 2388 n=m1*m1/v1/p 2389 for find,(n,p) in enumerate(zip(n,p)): 2390 Pn_f[find]=nbinom(n,1-p) 2391 Pn0_f=np.asarray([Pn_find.pmf(0) for Pn_find in Pn_f]) 2392 logPn0_f=np.log(Pn0_f) 2393 2394 elif noise_model==0: 2395 m1=m_total*fvec 2396 v1=m1+beta_mv*np.power(m1,alpha_mv) 2397 p=1-m1/v1 2398 n=m1*m1/v1/p 2399 Pn0_f=np.zeros((len(fvec),)) 2400 for find in range(len(Pn0_f)): 2401 nbtmp=nbinom(n[find],1-p[find]).pmf(np.arange(m_max+1)) 2402 ptmp=poisson(r_cvec[it]*np.arange(m_max+1)).pmf(0) 2403 Pn0_f[find]=np.sum(np.exp(np.log(nbtmp)+np.log(ptmp))) 2404 logPn0_f=np.log(Pn0_f) 2405 else: 2406 print('acq_model is 0,1,or 2 only') 2407 2408 if it==0: 2409 Pn1_f=Pn_f 2410 logPn10_f=logPn0_f 2411 else: 2412 Pn2_f=Pn_f 2413 logPn20_f=logPn0_f 2414 2415 #3-quadrant q|f conditional distribution (qx0:n1>0,n2=0;q0x:n1=0,n2>0;qxx:n1,n2>0) 2416 logPqx0_f=np.log(1-np.exp(logPn10_f))+logPn20_f 2417 logPq0x_f=logPn10_f+np.log(1-np.exp(logPn20_f)) 2418 logPqxx_f=np.log(1-np.exp(logPn10_f))+np.log(1-np.exp(logPn20_f)) 2419 #3-quadrant q,f joint distribution 2420 logPfqx0=logPqx0_f+logrhofvec 2421 logPfq0x=logPq0x_f+logrhofvec 2422 logPfqxx=logPqxx_f+logrhofvec 2423 #3-quadrant q marginal distribution 2424 Pqx0=np.trapz(np.exp(logPfqx0+logfvec),x=logfvec) 2425 Pq0x=np.trapz(np.exp(logPfq0x+logfvec),x=logfvec) 2426 Pqxx=np.trapz(np.exp(logPfqxx+logfvec),x=logfvec) 2427 2428 #3 quadrant conditional f|q distribution 2429 Pf_qx0=np.where(Pqx0>0,np.exp(logPfqx0-np.log(Pqx0)),0) 2430 Pf_q0x=np.where(Pq0x>0,np.exp(logPfq0x-np.log(Pq0x)),0) 2431 Pf_qxx=np.where(Pqxx>0,np.exp(logPfqxx-np.log(Pqxx)),0) 2432 2433 #3-quadrant q marginal distribution 2434 newPqZ=Pqx0 + Pq0x + Pqxx 2435 Pqx0/=newPqZ 2436 Pq0x/=newPqZ 2437 Pqxx/=newPqZ 2438 2439 Pfqx0=np.exp(logPfqx0) 2440 Pfq0x=np.exp(logPfq0x) 2441 Pfqxx=np.exp(logPfqxx) 2442 2443 print('Model probs: '+str(Pqx0)+' '+str(Pq0x)+' '+str(Pqxx)) 2444 2445 #get samples 2446 num_samples=Nsamp 2447 q_samples=np.random.choice(range(3), num_samples, p=(Pqx0,Pq0x,Pqxx)) 2448 vals,counts=np.unique(q_samples,return_counts=True) 2449 num_qx0=counts[0] 2450 num_q0x=counts[1] 2451 num_qxx=counts[2] 2452 print('q samples: '+str(sum(counts))+' '+str(num_qx0)+' '+str(num_q0x)+' '+str(num_qxx)) 2453 print('q sampled probs: '+str(num_qx0/float(sum(counts)))+' '+str(num_q0x/float(sum(counts)))+' '+str(num_qxx/float(sum(counts)))) 2454 2455 #x0 2456 integ=np.exp(np.log(Pf_qx0)+logfvec) 2457 f_samples_inds= self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_qx0).flatten() 2458 f_sorted_inds=np.argsort(f_samples_inds) 2459 f_samples_inds=f_samples_inds[f_sorted_inds] 2460 qx0_f_samples=fvec[f_samples_inds] 2461 find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True) 2462 qx0_samples=np.zeros((num_qx0,)) 2463 if noise_model<1: 2464 qx0_m_samples=np.zeros((num_qx0,)) 2465 #conditioning on n>0 applies an m-dependent factor to Pm_f, which can't be incorporated into the ppf method used for noise_model 1 and 2. 2466 #We handle that here by using a custom finite range sampler, which has the drawback of having to define an upper limit. 2467 #This works so long as n_max/r_c<<m_max, so depends on highest counts in data (n_max). My data had max counts of 1e3-1e4. 2468 #Alternatively, could define a custom scipy RV class by defining it's PMF, but has to be array-compatible which requires care. 2469 m_samp_max=int(1e5) 2470 mvec=np.arange(m_samp_max) 2471 2472 for it,find in enumerate(find_vals): 2473 if noise_model==0: 2474 m1=m_total*fvec[find] 2475 v1=m1+beta_mv*np.power(m1,alpha_mv) 2476 p=1-m1/v1 2477 n=m1*m1/v1/p 2478 Pm1_f=nbinom(n,1-p) 2479 2480 Pm1_f_adj=np.exp(np.log(1-np.exp(-r_c1*mvec))+np.log(Pm1_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c1+np.log(1-p))/(np.exp(r_c1)-p),n)))) #adds m-dependent factor due to conditioning on n>0... 2481 Pm1_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm1_f_adj/np.sum(Pm1_f_adj))) 2482 qx0_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm1_f_adj_obj.rvs(size=f_counts[it]) 2483 2484 mvals,minds,m_counts=np.unique(qx0_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True) 2485 for mit,m in enumerate(mvals): 2486 Pn1_m1=poisson(r_c1*m) 2487 samples=np.random.random(size=m_counts[mit]) * (1-Pn1_m1.cdf(0)) + Pn1_m1.cdf(0) 2488 qx0_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn1_m1.ppf(samples) 2489 2490 2491 elif noise_model>0: 2492 samples=np.random.random(size=f_counts[it]) * (1-Pn1_f[find].cdf(0)) + Pn1_f[find].cdf(0) 2493 qx0_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn1_f[find].ppf(samples) 2494 else: 2495 print('acq_model is 0,1, or 2 only') 2496 qx0_pair_samples=np.hstack((qx0_samples[:,np.newaxis],np.zeros((num_qx0,1)))) 2497 2498 #0x 2499 integ=np.exp(np.log(Pf_q0x)+logfvec) 2500 f_samples_inds=self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_q0x).flatten() 2501 f_sorted_inds=np.argsort(f_samples_inds) 2502 f_samples_inds=f_samples_inds[f_sorted_inds] 2503 q0x_f_samples=fvec[f_samples_inds] 2504 find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True) 2505 q0x_samples=np.zeros((num_q0x,)) 2506 if noise_model<1: 2507 q0x_m_samples=np.zeros((num_q0x,)) 2508 for it,find in enumerate(find_vals): 2509 if noise_model==0: 2510 m2=m_total*fvec[find] 2511 v2=m2+beta_mv*np.power(m2,alpha_mv) 2512 p=1-m2/v2 2513 n=m2*m2/v2/p 2514 Pm2_f=nbinom(n,1-p) 2515 2516 Pm2_f_adj=np.exp(np.log(1-np.exp(-r_c2*mvec))+np.log(Pm2_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c2+np.log(1-p))/(np.exp(r_c2)-p),n)))) #adds m-dependent factor due to conditioning on n>0... 2517 Pm2_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm2_f_adj/np.sum(Pm2_f_adj))) 2518 q0x_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm2_f_adj_obj.rvs(size=f_counts[it]) 2519 2520 mvals,minds,m_counts=np.unique(q0x_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True) 2521 for mit,m in enumerate(mvals): 2522 Pn2_m2=poisson(r_c2*m) 2523 samples=np.random.random(size=m_counts[mit]) * (1-Pn2_m2.cdf(0)) + Pn2_m2.cdf(0) 2524 q0x_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn2_m2.ppf(samples) 2525 2526 2527 2528 elif noise_model > 0: 2529 samples=np.random.random(size=f_counts[it]) * (1-Pn2_f[find].cdf(0)) + Pn2_f[find].cdf(0) 2530 q0x_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn2_f[find].ppf(samples) 2531 else: 2532 print('acq_model is 0,1,or 2 only') 2533 q0x_pair_samples=np.hstack((np.zeros((num_q0x,1)),q0x_samples[:,np.newaxis])) 2534 2535 #qxx 2536 integ=np.exp(np.log(Pf_qxx)+logfvec) 2537 f_samples_inds=self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_qxx).flatten() 2538 f_sorted_inds=np.argsort(f_samples_inds) 2539 f_samples_inds=f_samples_inds[f_sorted_inds] 2540 qxx_f_samples=fvec[f_samples_inds] 2541 find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True) 2542 qxx_n1_samples=np.zeros((num_qxx,)) 2543 qxx_n2_samples=np.zeros((num_qxx,)) 2544 if noise_model<1: 2545 qxx_m1_samples=np.zeros((num_qxx,)) 2546 qxx_m2_samples=np.zeros((num_qxx,)) 2547 for it,find in enumerate(find_vals): 2548 if noise_model==0: 2549 m1=m_total*fvec[find] 2550 v1=m1+beta_mv*np.power(m1,alpha_mv) 2551 p=1-m1/v1 2552 n=m1*m1/v1/p 2553 Pm1_f=nbinom(n,1-p) 2554 2555 Pm1_f_adj=np.exp(np.log(1-np.exp(-r_c1*mvec))+np.log(Pm1_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c1+np.log(1-p))/(np.exp(r_c1)-p),n)))) #adds m-dependent factor due to conditioning on n>0... 2556 if np.sum(Pm1_f_adj)==0: 2557 qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=1 2558 else: 2559 Pm1_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm1_f_adj/np.sum(Pm1_f_adj))) 2560 qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm1_f_adj_obj.rvs(size=f_counts[it]) 2561 2562 mvals,minds,m_counts=np.unique(qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True) 2563 for mit,m in enumerate(mvals): 2564 Pn1_m1=poisson(r_c1*m) 2565 samples=np.random.random(size=m_counts[mit]) * (1-Pn1_m1.cdf(0)) + Pn1_m1.cdf(0) 2566 qxx_n1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn1_m1.ppf(samples) 2567 2568 m2=m_total*fvec[find] 2569 v2=m2+beta_mv*np.power(m2,alpha_mv) 2570 p=1-m2/v2 2571 n=m2*m2/v2/p 2572 Pm2_f=nbinom(n,1-p) 2573 2574 Pm2_f_adj=np.exp(np.log(1-np.exp(-r_c2*mvec))+np.log(Pm2_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c2+np.log(1-p))/(np.exp(r_c2)-p),n)))) #adds m-dependent factor due to conditioning on n>0... 2575 if np.sum(Pm1_f_adj)==0: 2576 qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=1 2577 else: 2578 Pm2_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm2_f_adj/np.sum(Pm2_f_adj))) 2579 qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm2_f_adj_obj.rvs(size=f_counts[it]) 2580 2581 mvals,minds,m_counts=np.unique(qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True) 2582 for mit,m in enumerate(mvals): 2583 Pn2_m2=poisson(r_c2*m) 2584 samples=np.random.random(size=m_counts[mit]) * (1-Pn2_m2.cdf(0)) + Pn2_m2.cdf(0) 2585 qxx_n2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn2_m2.ppf(samples) 2586 2587 2588 elif noise_model>0: 2589 samples=np.random.random(size=f_counts[it]) * (1-Pn1_f[find].cdf(0)) + Pn1_f[find].cdf(0) 2590 qxx_n1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn1_f[find].ppf(samples) 2591 samples=np.random.random(size=f_counts[it]) * (1-Pn2_f[find].cdf(0)) + Pn2_f[find].cdf(0) 2592 qxx_n2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn2_f[find].ppf(samples) 2593 else: 2594 print('acq_model is 0,1, or 2 only') 2595 2596 qxx_pair_samples=np.hstack((qxx_n1_samples[:,np.newaxis],qxx_n2_samples[:,np.newaxis])) 2597 2598 pair_samples=np.vstack((q0x_pair_samples,qx0_pair_samples,qxx_pair_samples)) 2599 f_samples=np.concatenate((q0x_f_samples,qx0_f_samples,qxx_f_samples)) 2600 output_m_samples=False 2601 if noise_model<1 and output_m_samples: 2602 m1_samples=np.concatenate((q0x_m1_samples,qx0_m1_samples,qxx_m1_samples)) 2603 m2_samples=np.concatenate((q0x_m2_samples,qx0_m2_samples,qxx_m2_samples)) 2604 2605 pair_samples_df=pd.DataFrame({'Clone_count_1':pair_samples[:,0],'Clone_count_2':pair_samples[:,1]}) 2606 2607 pair_samples_df['Clone_fraction_1'] = pair_samples_df['Clone_count_1']/np.sum(pair_samples_df['Clone_count_1']) 2608 pair_samples_df['Clone_fraction_2'] = pair_samples_df['Clone_count_2']/np.sum(pair_samples_df['Clone_count_2']) 2609 2610 return f_samples,pair_samples_df 2611 2612 2613 def generate_trajectories(self, tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI = '1e6', NreadsII = '1e6'): 2614 2615 2616 """ 2617 generate in-silico t_ime apart RepSeq samples. 2618 2619 Parameters 2620 ---------- 2621 paras_1 : numpy array 2622 parameters of the noise model that has been learnt at time_1 2623 paras_2 : numpy array 2624 parameters of the noise model that has been learnt at time_2 2625 method : str 2626 'negative_binomial' or 'poisson' 2627 tau : float 2628 first time-scale parameter of the dynamics 2629 theta : float 2630 second time-scale parameter of the dynamics 2631 t_ime : float 2632 number of years between both synthetic sampling (between time_1 and time_2) 2633 filename : str 2634 name of the file in which the dataframe is stored 2635 Returns 2636 ------- 2637 data-frame - csv file 2638 the output is a csv file of columns : 'Clone_count_1' (at time_1) 'Clone_count_2' (at time_2) and the frequency counterparts 'Clone_frequency_1' and 'Clone_frequency_2' 2639 """ 2640 2641 np.seterr(divide = 'ignore') 2642 np.warnings.filterwarnings('ignore') 2643 2644 method = 'negative_binomial' 2645 2646 2647 # Synthetic data generation 2648 2649 print('execution starting...') 2650 2651 st = time.time() 2652 2653 #Values of the parameters 2654 A = -1/tau 2655 B = 1/theta 2656 N_0 = 40 2657 NreadsI = float(NreadsI) 2658 NreadsII = float(NreadsII) 2659 2660 t = float(t_ime) 2661 2662 if NreadsI == NreadsII: 2663 key_sym = '_sym_' 2664 2665 else: 2666 key_sym = '_asym_' 2667 2668 # Name of the directory 2669 2670 2671 dirName = 'output' 2672 os.makedirs(dirName, exist_ok=True) 2673 2674 paras = paras_1 #Just put a and b of the negative binomiale distribution [0.7, 1.1] 2675 alpha = -1 +2*A/B 2676 #print('alpha : ' + str(alpha)) 2677 2678 #1/ Generate log-population at initial time from steady-state distribution + GBM diffusion trajectories for 2 years 2679 x_i_LB, x_f_LB, Prop_Matrix_LB, p_ext_LB, results_extinction_LB, time_vec_LB, results_extinction_source_LB, x_source_LB = _generator_diffusion_LB(A, B, N_0, t) 2680 2681 #x_i_LB, x_f_LB, Prop_Matrix, p_ext, results_extinction = generator_diffusion_LB(B, A, N_0, t) 2682 N_cells_day_0_LB, N_cells_day_1_LB = np.sum(np.exp(x_i_LB)), np.sum(np.exp(x_f_LB)) + np.sum(np.exp(x_source_LB)) #N_cells_final_LB 2683 print('NUMBER OF CELLS AT INITIAL TIME') 2684 print(N_cells_day_0_LB) 2685 2686 print('NUMBER OF CELLS AT FINAL TIME') 2687 print(N_cells_day_1_LB) 2688 2689 #print('SHAPE_X_I ' + str(np.shape(x_i_LB))) 2690 #print('SHAPE_X_F ' + str(np.shape(x_f_LB))) 2691 2692 2693 if method == 'negative_binomial': 2694 2695 df_diffusion_LB = _experimental_sampling_diffusion_NegBin(NreadsI, NreadsII, paras, x_i_LB, x_f_LB, N_cells_day_0_LB, N_cells_day_1_LB) 2696 df_diffusion_LB.to_csv(filename + '.csv' , sep= '\t') 2697 2698 elif method == 'poisson': 2699 2700 df_diffusion_LB = _experimental_sampling_diffusion_Poisson(NreadsI, NreadsII, x_i_LB, x_f_LB, t, N_cells_day_0_LB, N_cells_day_1_LB) 2701 df_diffusion_LB.to_csv(filename + '.csv' , sep= '\t')
A class used to build an object to generate in-Silico (synthetic) RepSeq samples, in the case of replicates at the same day and in the case of having 2 samples generated at an initial time for the first one and some time after (months, years) for the second one using the geometric Brownian motion model decribed in https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. ...
Methods
gen_synthetic_data_Null(paras, noise_model, NreadsI,NreadsII,Nsamp): generate in-silico same day RepSeq replicates. generate_trajectories(tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI = '1e6', NreadsII = '1e6'): generate in-silico t_ime apart RepSeq samples.
2344 def gen_synthetic_data_Null(self, paras, noise_model, NreadsI,NreadsII,Nsamp): 2345 ''' 2346 outputs an array of observed clone frequencies and corresponding dataframe of pair counts 2347 for a null model learned from a dataset pair with NreadsI and NreadsII number of reads, respectively. 2348 Crucial for RAM efficiency, sampling is conditioned on being observed in each of the three (n,0), (0,n'), and n,n'>0 conditions 2349 so that only Nsamp clones need to be sampled, rather than the N clones in the repertoire. 2350 Note that no explicit normalization is applied. It is assumed that the values in paras are consistent with N<f>=1 2351 (e.g. were obtained through the learning done in this package). 2352 ''' 2353 2354 2355 alpha = paras[0] #power law exponent 2356 fmin=np.power(10,paras[-1]) 2357 if noise_model<1: 2358 m_total=float(np.power(10, paras[3])) 2359 r_c1=NreadsI/m_total 2360 r_c2=NreadsII/m_total 2361 r_cvec=[r_c1,r_c2] 2362 if noise_model<2: 2363 beta_mv= paras[1] 2364 alpha_mv=paras[2] 2365 2366 logrhofvec,logfvec = self.get_rhof(alpha,fmin) 2367 fvec=np.exp(logfvec) 2368 dlogf=np.diff(logfvec)/2. 2369 2370 #generate measurement model distribution, Pn_f 2371 Pn_f=np.empty((len(logfvec),),dtype=object) #len(logfvec) samplers 2372 2373 #get value at n=0 to use for conditioning on n>0 (and get full Pn_f here if noise_model=1,2) 2374 m_max=1e3 #conditioned on n=0, so no edge effects 2375 2376 Nreadsvec=(NreadsI,NreadsII) 2377 for it in range(2): 2378 Pn_f=np.empty((len(fvec),),dtype=object) 2379 if noise_model==2: 2380 m1vec=Nreadsvec[it]*fvec 2381 for find,m1 in enumerate(m1vec): 2382 Pn_f[find]=poisson(m1) 2383 logPn0_f=-m1vec 2384 elif noise_model==1: 2385 m1=Nreadsvec[it]*fvec 2386 v1=m1+beta_mv*np.power(m1,alpha_mv) 2387 p=1-m1/v1 2388 n=m1*m1/v1/p 2389 for find,(n,p) in enumerate(zip(n,p)): 2390 Pn_f[find]=nbinom(n,1-p) 2391 Pn0_f=np.asarray([Pn_find.pmf(0) for Pn_find in Pn_f]) 2392 logPn0_f=np.log(Pn0_f) 2393 2394 elif noise_model==0: 2395 m1=m_total*fvec 2396 v1=m1+beta_mv*np.power(m1,alpha_mv) 2397 p=1-m1/v1 2398 n=m1*m1/v1/p 2399 Pn0_f=np.zeros((len(fvec),)) 2400 for find in range(len(Pn0_f)): 2401 nbtmp=nbinom(n[find],1-p[find]).pmf(np.arange(m_max+1)) 2402 ptmp=poisson(r_cvec[it]*np.arange(m_max+1)).pmf(0) 2403 Pn0_f[find]=np.sum(np.exp(np.log(nbtmp)+np.log(ptmp))) 2404 logPn0_f=np.log(Pn0_f) 2405 else: 2406 print('acq_model is 0,1,or 2 only') 2407 2408 if it==0: 2409 Pn1_f=Pn_f 2410 logPn10_f=logPn0_f 2411 else: 2412 Pn2_f=Pn_f 2413 logPn20_f=logPn0_f 2414 2415 #3-quadrant q|f conditional distribution (qx0:n1>0,n2=0;q0x:n1=0,n2>0;qxx:n1,n2>0) 2416 logPqx0_f=np.log(1-np.exp(logPn10_f))+logPn20_f 2417 logPq0x_f=logPn10_f+np.log(1-np.exp(logPn20_f)) 2418 logPqxx_f=np.log(1-np.exp(logPn10_f))+np.log(1-np.exp(logPn20_f)) 2419 #3-quadrant q,f joint distribution 2420 logPfqx0=logPqx0_f+logrhofvec 2421 logPfq0x=logPq0x_f+logrhofvec 2422 logPfqxx=logPqxx_f+logrhofvec 2423 #3-quadrant q marginal distribution 2424 Pqx0=np.trapz(np.exp(logPfqx0+logfvec),x=logfvec) 2425 Pq0x=np.trapz(np.exp(logPfq0x+logfvec),x=logfvec) 2426 Pqxx=np.trapz(np.exp(logPfqxx+logfvec),x=logfvec) 2427 2428 #3 quadrant conditional f|q distribution 2429 Pf_qx0=np.where(Pqx0>0,np.exp(logPfqx0-np.log(Pqx0)),0) 2430 Pf_q0x=np.where(Pq0x>0,np.exp(logPfq0x-np.log(Pq0x)),0) 2431 Pf_qxx=np.where(Pqxx>0,np.exp(logPfqxx-np.log(Pqxx)),0) 2432 2433 #3-quadrant q marginal distribution 2434 newPqZ=Pqx0 + Pq0x + Pqxx 2435 Pqx0/=newPqZ 2436 Pq0x/=newPqZ 2437 Pqxx/=newPqZ 2438 2439 Pfqx0=np.exp(logPfqx0) 2440 Pfq0x=np.exp(logPfq0x) 2441 Pfqxx=np.exp(logPfqxx) 2442 2443 print('Model probs: '+str(Pqx0)+' '+str(Pq0x)+' '+str(Pqxx)) 2444 2445 #get samples 2446 num_samples=Nsamp 2447 q_samples=np.random.choice(range(3), num_samples, p=(Pqx0,Pq0x,Pqxx)) 2448 vals,counts=np.unique(q_samples,return_counts=True) 2449 num_qx0=counts[0] 2450 num_q0x=counts[1] 2451 num_qxx=counts[2] 2452 print('q samples: '+str(sum(counts))+' '+str(num_qx0)+' '+str(num_q0x)+' '+str(num_qxx)) 2453 print('q sampled probs: '+str(num_qx0/float(sum(counts)))+' '+str(num_q0x/float(sum(counts)))+' '+str(num_qxx/float(sum(counts)))) 2454 2455 #x0 2456 integ=np.exp(np.log(Pf_qx0)+logfvec) 2457 f_samples_inds= self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_qx0).flatten() 2458 f_sorted_inds=np.argsort(f_samples_inds) 2459 f_samples_inds=f_samples_inds[f_sorted_inds] 2460 qx0_f_samples=fvec[f_samples_inds] 2461 find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True) 2462 qx0_samples=np.zeros((num_qx0,)) 2463 if noise_model<1: 2464 qx0_m_samples=np.zeros((num_qx0,)) 2465 #conditioning on n>0 applies an m-dependent factor to Pm_f, which can't be incorporated into the ppf method used for noise_model 1 and 2. 2466 #We handle that here by using a custom finite range sampler, which has the drawback of having to define an upper limit. 2467 #This works so long as n_max/r_c<<m_max, so depends on highest counts in data (n_max). My data had max counts of 1e3-1e4. 2468 #Alternatively, could define a custom scipy RV class by defining it's PMF, but has to be array-compatible which requires care. 2469 m_samp_max=int(1e5) 2470 mvec=np.arange(m_samp_max) 2471 2472 for it,find in enumerate(find_vals): 2473 if noise_model==0: 2474 m1=m_total*fvec[find] 2475 v1=m1+beta_mv*np.power(m1,alpha_mv) 2476 p=1-m1/v1 2477 n=m1*m1/v1/p 2478 Pm1_f=nbinom(n,1-p) 2479 2480 Pm1_f_adj=np.exp(np.log(1-np.exp(-r_c1*mvec))+np.log(Pm1_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c1+np.log(1-p))/(np.exp(r_c1)-p),n)))) #adds m-dependent factor due to conditioning on n>0... 2481 Pm1_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm1_f_adj/np.sum(Pm1_f_adj))) 2482 qx0_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm1_f_adj_obj.rvs(size=f_counts[it]) 2483 2484 mvals,minds,m_counts=np.unique(qx0_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True) 2485 for mit,m in enumerate(mvals): 2486 Pn1_m1=poisson(r_c1*m) 2487 samples=np.random.random(size=m_counts[mit]) * (1-Pn1_m1.cdf(0)) + Pn1_m1.cdf(0) 2488 qx0_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn1_m1.ppf(samples) 2489 2490 2491 elif noise_model>0: 2492 samples=np.random.random(size=f_counts[it]) * (1-Pn1_f[find].cdf(0)) + Pn1_f[find].cdf(0) 2493 qx0_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn1_f[find].ppf(samples) 2494 else: 2495 print('acq_model is 0,1, or 2 only') 2496 qx0_pair_samples=np.hstack((qx0_samples[:,np.newaxis],np.zeros((num_qx0,1)))) 2497 2498 #0x 2499 integ=np.exp(np.log(Pf_q0x)+logfvec) 2500 f_samples_inds=self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_q0x).flatten() 2501 f_sorted_inds=np.argsort(f_samples_inds) 2502 f_samples_inds=f_samples_inds[f_sorted_inds] 2503 q0x_f_samples=fvec[f_samples_inds] 2504 find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True) 2505 q0x_samples=np.zeros((num_q0x,)) 2506 if noise_model<1: 2507 q0x_m_samples=np.zeros((num_q0x,)) 2508 for it,find in enumerate(find_vals): 2509 if noise_model==0: 2510 m2=m_total*fvec[find] 2511 v2=m2+beta_mv*np.power(m2,alpha_mv) 2512 p=1-m2/v2 2513 n=m2*m2/v2/p 2514 Pm2_f=nbinom(n,1-p) 2515 2516 Pm2_f_adj=np.exp(np.log(1-np.exp(-r_c2*mvec))+np.log(Pm2_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c2+np.log(1-p))/(np.exp(r_c2)-p),n)))) #adds m-dependent factor due to conditioning on n>0... 2517 Pm2_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm2_f_adj/np.sum(Pm2_f_adj))) 2518 q0x_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm2_f_adj_obj.rvs(size=f_counts[it]) 2519 2520 mvals,minds,m_counts=np.unique(q0x_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True) 2521 for mit,m in enumerate(mvals): 2522 Pn2_m2=poisson(r_c2*m) 2523 samples=np.random.random(size=m_counts[mit]) * (1-Pn2_m2.cdf(0)) + Pn2_m2.cdf(0) 2524 q0x_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn2_m2.ppf(samples) 2525 2526 2527 2528 elif noise_model > 0: 2529 samples=np.random.random(size=f_counts[it]) * (1-Pn2_f[find].cdf(0)) + Pn2_f[find].cdf(0) 2530 q0x_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn2_f[find].ppf(samples) 2531 else: 2532 print('acq_model is 0,1,or 2 only') 2533 q0x_pair_samples=np.hstack((np.zeros((num_q0x,1)),q0x_samples[:,np.newaxis])) 2534 2535 #qxx 2536 integ=np.exp(np.log(Pf_qxx)+logfvec) 2537 f_samples_inds=self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_qxx).flatten() 2538 f_sorted_inds=np.argsort(f_samples_inds) 2539 f_samples_inds=f_samples_inds[f_sorted_inds] 2540 qxx_f_samples=fvec[f_samples_inds] 2541 find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True) 2542 qxx_n1_samples=np.zeros((num_qxx,)) 2543 qxx_n2_samples=np.zeros((num_qxx,)) 2544 if noise_model<1: 2545 qxx_m1_samples=np.zeros((num_qxx,)) 2546 qxx_m2_samples=np.zeros((num_qxx,)) 2547 for it,find in enumerate(find_vals): 2548 if noise_model==0: 2549 m1=m_total*fvec[find] 2550 v1=m1+beta_mv*np.power(m1,alpha_mv) 2551 p=1-m1/v1 2552 n=m1*m1/v1/p 2553 Pm1_f=nbinom(n,1-p) 2554 2555 Pm1_f_adj=np.exp(np.log(1-np.exp(-r_c1*mvec))+np.log(Pm1_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c1+np.log(1-p))/(np.exp(r_c1)-p),n)))) #adds m-dependent factor due to conditioning on n>0... 2556 if np.sum(Pm1_f_adj)==0: 2557 qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=1 2558 else: 2559 Pm1_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm1_f_adj/np.sum(Pm1_f_adj))) 2560 qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm1_f_adj_obj.rvs(size=f_counts[it]) 2561 2562 mvals,minds,m_counts=np.unique(qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True) 2563 for mit,m in enumerate(mvals): 2564 Pn1_m1=poisson(r_c1*m) 2565 samples=np.random.random(size=m_counts[mit]) * (1-Pn1_m1.cdf(0)) + Pn1_m1.cdf(0) 2566 qxx_n1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn1_m1.ppf(samples) 2567 2568 m2=m_total*fvec[find] 2569 v2=m2+beta_mv*np.power(m2,alpha_mv) 2570 p=1-m2/v2 2571 n=m2*m2/v2/p 2572 Pm2_f=nbinom(n,1-p) 2573 2574 Pm2_f_adj=np.exp(np.log(1-np.exp(-r_c2*mvec))+np.log(Pm2_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c2+np.log(1-p))/(np.exp(r_c2)-p),n)))) #adds m-dependent factor due to conditioning on n>0... 2575 if np.sum(Pm1_f_adj)==0: 2576 qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=1 2577 else: 2578 Pm2_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm2_f_adj/np.sum(Pm2_f_adj))) 2579 qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm2_f_adj_obj.rvs(size=f_counts[it]) 2580 2581 mvals,minds,m_counts=np.unique(qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True) 2582 for mit,m in enumerate(mvals): 2583 Pn2_m2=poisson(r_c2*m) 2584 samples=np.random.random(size=m_counts[mit]) * (1-Pn2_m2.cdf(0)) + Pn2_m2.cdf(0) 2585 qxx_n2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn2_m2.ppf(samples) 2586 2587 2588 elif noise_model>0: 2589 samples=np.random.random(size=f_counts[it]) * (1-Pn1_f[find].cdf(0)) + Pn1_f[find].cdf(0) 2590 qxx_n1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn1_f[find].ppf(samples) 2591 samples=np.random.random(size=f_counts[it]) * (1-Pn2_f[find].cdf(0)) + Pn2_f[find].cdf(0) 2592 qxx_n2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn2_f[find].ppf(samples) 2593 else: 2594 print('acq_model is 0,1, or 2 only') 2595 2596 qxx_pair_samples=np.hstack((qxx_n1_samples[:,np.newaxis],qxx_n2_samples[:,np.newaxis])) 2597 2598 pair_samples=np.vstack((q0x_pair_samples,qx0_pair_samples,qxx_pair_samples)) 2599 f_samples=np.concatenate((q0x_f_samples,qx0_f_samples,qxx_f_samples)) 2600 output_m_samples=False 2601 if noise_model<1 and output_m_samples: 2602 m1_samples=np.concatenate((q0x_m1_samples,qx0_m1_samples,qxx_m1_samples)) 2603 m2_samples=np.concatenate((q0x_m2_samples,qx0_m2_samples,qxx_m2_samples)) 2604 2605 pair_samples_df=pd.DataFrame({'Clone_count_1':pair_samples[:,0],'Clone_count_2':pair_samples[:,1]}) 2606 2607 pair_samples_df['Clone_fraction_1'] = pair_samples_df['Clone_count_1']/np.sum(pair_samples_df['Clone_count_1']) 2608 pair_samples_df['Clone_fraction_2'] = pair_samples_df['Clone_count_2']/np.sum(pair_samples_df['Clone_count_2']) 2609 2610 return f_samples,pair_samples_df
outputs an array of observed clone frequencies and corresponding dataframe of pair counts
for a null model learned from a dataset pair with NreadsI and NreadsII number of reads, respectively.
Crucial for RAM efficiency, sampling is conditioned on being observed in each of the three (n,0), (0,n'), and n,n'>0 conditions
so that only Nsamp clones need to be sampled, rather than the N clones in the repertoire.
Note that no explicit normalization is applied. It is assumed that the values in paras are consistent with N
2613 def generate_trajectories(self, tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI = '1e6', NreadsII = '1e6'): 2614 2615 2616 """ 2617 generate in-silico t_ime apart RepSeq samples. 2618 2619 Parameters 2620 ---------- 2621 paras_1 : numpy array 2622 parameters of the noise model that has been learnt at time_1 2623 paras_2 : numpy array 2624 parameters of the noise model that has been learnt at time_2 2625 method : str 2626 'negative_binomial' or 'poisson' 2627 tau : float 2628 first time-scale parameter of the dynamics 2629 theta : float 2630 second time-scale parameter of the dynamics 2631 t_ime : float 2632 number of years between both synthetic sampling (between time_1 and time_2) 2633 filename : str 2634 name of the file in which the dataframe is stored 2635 Returns 2636 ------- 2637 data-frame - csv file 2638 the output is a csv file of columns : 'Clone_count_1' (at time_1) 'Clone_count_2' (at time_2) and the frequency counterparts 'Clone_frequency_1' and 'Clone_frequency_2' 2639 """ 2640 2641 np.seterr(divide = 'ignore') 2642 np.warnings.filterwarnings('ignore') 2643 2644 method = 'negative_binomial' 2645 2646 2647 # Synthetic data generation 2648 2649 print('execution starting...') 2650 2651 st = time.time() 2652 2653 #Values of the parameters 2654 A = -1/tau 2655 B = 1/theta 2656 N_0 = 40 2657 NreadsI = float(NreadsI) 2658 NreadsII = float(NreadsII) 2659 2660 t = float(t_ime) 2661 2662 if NreadsI == NreadsII: 2663 key_sym = '_sym_' 2664 2665 else: 2666 key_sym = '_asym_' 2667 2668 # Name of the directory 2669 2670 2671 dirName = 'output' 2672 os.makedirs(dirName, exist_ok=True) 2673 2674 paras = paras_1 #Just put a and b of the negative binomiale distribution [0.7, 1.1] 2675 alpha = -1 +2*A/B 2676 #print('alpha : ' + str(alpha)) 2677 2678 #1/ Generate log-population at initial time from steady-state distribution + GBM diffusion trajectories for 2 years 2679 x_i_LB, x_f_LB, Prop_Matrix_LB, p_ext_LB, results_extinction_LB, time_vec_LB, results_extinction_source_LB, x_source_LB = _generator_diffusion_LB(A, B, N_0, t) 2680 2681 #x_i_LB, x_f_LB, Prop_Matrix, p_ext, results_extinction = generator_diffusion_LB(B, A, N_0, t) 2682 N_cells_day_0_LB, N_cells_day_1_LB = np.sum(np.exp(x_i_LB)), np.sum(np.exp(x_f_LB)) + np.sum(np.exp(x_source_LB)) #N_cells_final_LB 2683 print('NUMBER OF CELLS AT INITIAL TIME') 2684 print(N_cells_day_0_LB) 2685 2686 print('NUMBER OF CELLS AT FINAL TIME') 2687 print(N_cells_day_1_LB) 2688 2689 #print('SHAPE_X_I ' + str(np.shape(x_i_LB))) 2690 #print('SHAPE_X_F ' + str(np.shape(x_f_LB))) 2691 2692 2693 if method == 'negative_binomial': 2694 2695 df_diffusion_LB = _experimental_sampling_diffusion_NegBin(NreadsI, NreadsII, paras, x_i_LB, x_f_LB, N_cells_day_0_LB, N_cells_day_1_LB) 2696 df_diffusion_LB.to_csv(filename + '.csv' , sep= '\t') 2697 2698 elif method == 'poisson': 2699 2700 df_diffusion_LB = _experimental_sampling_diffusion_Poisson(NreadsI, NreadsII, x_i_LB, x_f_LB, t, N_cells_day_0_LB, N_cells_day_1_LB) 2701 df_diffusion_LB.to_csv(filename + '.csv' , sep= '\t')
generate in-silico t_ime apart RepSeq samples.
Parameters
- paras_1 (numpy array): parameters of the noise model that has been learnt at time_1
- paras_2 (numpy array): parameters of the noise model that has been learnt at time_2
- method (str): 'negative_binomial' or 'poisson'
- tau (float): first time-scale parameter of the dynamics
- theta (float): second time-scale parameter of the dynamics
- t_ime (float): number of years between both synthetic sampling (between time_1 and time_2)
- filename (str): name of the file in which the dataframe is stored
Returns
- data-frame - csv file: the output is a csv file of columns : 'Clone_count_1' (at time_1) 'Clone_count_2' (at time_2) and the frequency counterparts 'Clone_frequency_1' and 'Clone_frequency_2'