noisets.noisettes

NoisET* NOIse sampling learning & Expansion detection of T-cell receptors using Bayesian inference.

High-throughput sequencing of T- and B-cell receptors makes it possible to track immune repertoires across time, in different tissues, in acute and chronic diseases or in healthy individuals. However quantitative comparison between repertoires is confounded by variability in the read count of each receptor clonotype due to sampling, library preparation, and expression noise. We present an easy-to-use python package NoisET that implements and generalizes a previously developed Bayesian method in Puelma Touzel et al, 2020. It can be used to learn experimental noise models for repertoire sequencing from replicates, and to detect responding clones following a stimulus. The package was tested on different repertoire sequencing technologies and datasets. NoisET package is desribed here.
* NoisET should be pronounced as "noisettes" (ie hazelnuts in French).
Functions library for NoisET - construction of noisettes package Copyright (C) 2021 Meriem Bensouda Koraichi. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Installation

Python 3 NoisET is a python /3.6 software. It is available on PyPI and can be downloaded and installed through pip:

$ pip install noisets

Watch out, data pre-processing, diversity estimates and generation of neutral TCR clonal dynamics is not possible yet with installation with pip. Use only the sudo command below. To install NoisET and try the tutorial dusplayed in this github: gitclone the file in your working environment. Using the terminal, go to NoisET directory and write the following command :

$ sudo python setup.py install

If you do not have the following python libraries (that are useful to use NoisET) : numpy, pandas, matplotlib, seaborn, scipy, scikit-learn, please do the following commands, to try first to install the dependencies separately: :

python -m pip install -U pip
python -m pip install -U matplotlib
pip install numpy
pip install pandas
pip install matplotlib
pip install seaborn
pip install -U scikit-learn

Documentation

Command lines with terminal

A tutorial is available at https://github.com/mbensouda/NoisET_tutorial . Three commands are available to use :

  • noiset-noise To infer Null noise model: NoisET first function (1)
  • noiset-nullgenerator To qualitatively check consistency of NoisET first function
  • noiset-detection To detect responding clones to a stimulus: NoisET second function (2) All options are described typing one of the previous commands + --helpor -h. Options are also described in the following READme.

1/ Infer noise model

To infer null noise model: NoisET first function (1), use the command noiset-noise At the command prompt, type:

$ noiset-noise --path 'DATA_REPO/' --f1 'FILENAME1_X_REP1' --f2 'FILENAME2_X_REP2' --(noisemodel)

Several options are needed to learn noise model from two replicate samples associated to one individual at a specific time point:

1/ Data information:

  • --path 'PATHTODATA': set path to data file
  • --f1 'FILENAME1_X_REP1': filename for individual X replicate 1
  • --f2 'FILENAME2_X_REP2': filename for individual X replicate 2 If your TCR CDR3 clonal populations features (ie clonal fractions, clonal counts, clonal nucleotide CDR3 sequences and clonal amino acid sequences) have different column names than: ('Clone fraction', 'Clone count', 'N. Seq. CDR3', 'AA. Seq. CDR3), you can specify the name directly by using:
  • --specify
  • --freq 'frequency' : Column label associated to clonal fraction
  • --counts 'counts': Column label associated to clonal count
  • --ntCDR3 'ntCDR3': Column label associated to clonal CDR3 nucleotides sequence
  • --AACDR3 'AACDR3': Column label associated to clonal CDR3 amino acid sequence

    2/ Choice of noise model: (parameters meaning described in Methods section)

  • --NBPoisson: Negative Binomial + Poisson Noise Model - 5 parameters

  • --NB: Negative Binomial - 4 parameters
  • --Poisson: Poisson - 2 parameters

    3/ Example:

At the command prompt, type:

$ noiset-noise --path 'data_examples/' --f1 'Q1_0_F1_.txt.gz' --f2 'Q1_0_F2_.txt.gz' --NB

This command line will learn four parameters associated to negative binomial null noise Model --NB for individual Q1 at day 0. A '.txt' file is created in the working directory: it is a 5/4/2 parameters data-set regarding on NBP/NB/Poisson noise model. In this example, it is a four parameters table (already created in data_examples repository). You can run previous examples using data (Q1 day 0/ day15) provided in the data_examples folder - data from Precise tracking of vaccine-responding T cell clones reveals convergent and personalized response in identical twins, Pogorelyy et al, PNAS

4/ Example with --specify:

At the command prompt, type:

$ noiset-noise --path 'data_examples/' --f1 'replicate_1_1.tsv.gz' --f2 'replicate_1_2.tsv.gz' --specify --freq 'frequencyCount' --counts 'count' --ntCDR3 'nucleotide' --AACDR3 'aminoAcid' --NB

As previously this command enables us to learn four parameters associated to negative binomial null noise model --NB for one individual in cohort produced in Model to improve specificity for identification of clinically-relevant expanded T cells in peripheral blood, Rytlewski et al, PLOS ONE.

2/ Generate synthetic data from null model learning:

To qualitatively check consistency of NoisET first function (1) with experiments or for other reasons, it can be useful to generates synthetic replicates from the null model (described in Methods section). One can also generalte healthy RepSeq samples dynamics using the noise model which has been learned in a first step anf giving the time-scale dynamics of turnover of the repertoire as defined in https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. Check here.
To generate synthetic TCR RepSeq data replicates having chosen sampling noise characteristics, use the command noiset-nullgenerator

$ noiset-nullgenerator --(noise-model) --nullpara 'NULLPARAS' --NreadsI float --NreadsII float --Nclones float --output 'SYNTHETICDATA' 

1/ Choice of noise model:

The user must chose one of the three possible models for the probability that a TCR has an empirical count n knowing that its true frequency is f , P(n|f): a Poisson distribution --Poisson, a negative binomial distribution --NB, or a two-step model combining Negative-Binomial and a Poisson distribution --NBP. n is the empirical clone size and depends on the experimental protocol. For each P(n|f), a set of parameters is learned.

  • --NBPoisson: Negative Binomial + Poisson Noise Model - 5 parameters 5 parameters described in Puelma Touzel et al, 2020: power-law exponent of clonotypes frequencies distributions 'alph_rho', minimum of clonotype frequencies distribution 'fmin', 'beta' and 'alpha', parameters of negative binomial distribution constraining mean and variance of P(m|f) distribution (m being the number of cells associated to a clonotype in the experiemental sample), and 'm_total' the total number of cells in the sample of interest..
  • --NB: Negative Binomial - 4 parameters: power-law of the clonotypes frequencies distributions (same ansatz than in Puelma Touzel et al, 2020 'alph_rho', minimum of clonotype frequencies distribution 'fmin', 'beta' and 'alpha', parameters of negative binomial distribution constraining mean and variance of P(n|f) distribution. NB(fNreads, fNreads + betafNreadsalpha) . (Nreads is the total number of reads in the sample of interest.)
  • --Poisson: Poisson - 2 parameters power-law of the clonotypes frequencies distributions (same ansatz than in Puelma Touzel et al, 2020'alph_rho' and minimum of clonotype frequencies distribution 'fmin'. P(n|f) is a Poisson distribution of parameter fNreads . (Nreads is the total number of reads in the sample of interest.)

2/ Specify learned noise parameters:

  • --nullpara 'PATHTOFOLDER/NULLPARAS.txt': parameters learned thanks to NoisET function (1) !!! Make sure to match correctly the noise model and the null parameter file content : 5 parameters for --NBP, 4 parameters for --NBand 2 parameters for --Poisson.

    3/ Sequencing properties of data:

  • --NreadsI NNNN: total number of reads in first replicate - it should match the actual data. In the example below, it is the sum of 'Clone count' in 'Q1_0_F1_.txt.gz'.

  • --Nreads2 NNNN: total number of reads in second replicate - it should match the actual data. In the example below, it is the sum of 'Clone count' in 'Q1_0_F2_.txt.gz'.
  • --Nclones NNNN: total number of clones in union of two replicates - it should match the actual data. In the example below, it is the number of clones present in both replicates : 'Q1_0_F1_.txt.gz' and 'Q1_0_F2_.txt.gz'.

    4/ Output file

--output 'SYNTHETICDATA': name of the output file where you can find the synthetic data set. At the command prompt, type

$ noiset-nullgenerator --NB --nullpara 'data_examples/nullpara1.txt' --NreadsI 829578 --NreadsII 954389 --Nclones 776247 --output 'test' 

Running this line, you create a 'synthetic_test.csv' file with four columns : 'Clone_count_1', 'Clone_count_2', 'Clone_fraction_1', 'Clone_fraction_2', resctively synthetic read counts and frequencies that you would have found in an experimental sample of same learned parameters 'nullpara1.txt', 'NreadsI', 'NreadsII' and 'Nclones'.

3/ Detect responding clones:

Detects responding clones to a stimulus: NoisET second function (2) To detect responding clones from two RepSeq data at time_1 and time_2, use the command noiset-detection

$ noiset-detection --(noisemodel)  --nullpara1 'FILEFORPARAS1' --nullpara2 'FILEFORPARAS1' --path 'REPO/' --f1 'FILENAME_TIME_1' --f2 'FILENAME_TIME_2' --pval float --smedthresh float --output 'DETECTIONDATA' 

Several options are needed to learn noise model from two replicate samples associated to one individual at a specific time point:

1/ Choice of noise model:

  • --NBPoisson: Negative Binomial + Poisson Noise Model - 5 parameters
  • --NB: Negative Binomial - 4 parameters
  • --Poisson: Poisson - 2 parameters

    2/ Specify learned parameters for both time points:

(they can be the same for both time points if replicates are not available but to use carefully as mentioned in [ARTICLE])

  • --nullpara1 'PATH/FOLDER/NULLPARAS1.txt': parameters learned thanks to NoisET function (1) for time 1
  • --nullpara2 'PATH/FOLDER/NULLPARAS2.txt': parameters learned thanks to NoisET function (1) for time 2
    !!! Make sure to match correctly the noise model and the null parameters file content : 5 parameters for --NBP, 4 parameters for --NBand 2 parameters for --Poisson.

3/ Data information:

  • --path 'PATHTODATA': set path to data file
  • --f1 'FILENAME1_X_time1': filename for individual X time 1
  • --f2 'FILENAME2_X_time2': filename for individual X time 2 If your TCR CDR3 clonal populations features (ie clonal fractions, clonal counts, clonal nucleotides CDR3 sequences and clonal amino acids sequences) have different column names than: ('Clone fraction', 'Clone count', 'N. Seq. CDR3', 'AA. Seq. CDR3), you can specify the name by using:
  • --specify
  • --freq 'frequency' : Column label associated to clonal fraction
  • --counts 'counts': Column label associated to clonal count
  • --ntCDR3 'ntCDR3': Column label associated to clonal CDR3 nucleotides sequence
  • --AACDR3 'AACDR3': Column label associated to clonal CDR3 amino acid sequence

    4/ Detection thresholds: (More details in Methods section).

  • --pval XXX : p-value threshold for the expansion/contraction - use 0.05 as a default value.

  • --smedthresh XXX : log fold change median threshold for the expansion/contraction - use 0 as a default value.

    5/ Output file

--output 'DETECTIONDATA': name of the output file (.csv) where you can find a list of the putative responding clones with statistics features. (More details in Methods section). At the command prompt, type

$ noiset-detection --NB  --nullpara1 'data_examples/nullpara1.txt' --nullpara2 'data_examples/nullpara1.txt' --path 'data_examples/' --f1 'Q1_0_F1_.txt.gz' --f2 'Q1_15_F1_.txt.gz' --pval 0.05 --smedthresh 0 --output 'detection' 

Ouput: table containing all putative detected clones with statistics features about logfold-change variable s : more theoretical description Puelma Touzel et al, 2020.

Python package

   1"""
   2# NoisET<sup>*</sup>  NOIse sampling learning & Expansion detection of T-cell receptors using Bayesian inference.
   3High-throughput sequencing of T- and B-cell receptors makes it possible to track immune
   4repertoires across time, in different tissues, in acute and chronic diseases or in healthy individuals. However
   5quantitative comparison between repertoires is confounded by variability in the read count of each receptor
   6clonotype due to sampling, library preparation, and expression noise. We present an easy-to-use python
   7package NoisET that implements and generalizes a previously developed Bayesian method in [Puelma Touzel et al, 2020](<https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007873&rev=2>). It can be used
   8to learn experimental noise models for repertoire sequencing from replicates, and to detect responding
   9clones following a stimulus. The package was tested on different repertoire sequencing technologies and
  10datasets. NoisET package is desribed [here](<https://arxiv.org/abs/2102.03568>). 
  11<sup>* NoisET should be pronounced as "noisettes" (ie hazelnuts in French).</sup>
  12Functions library for NoisET - construction of noisettes package
  13Copyright (C) 2021 Meriem Bensouda Koraichi. 
  14   This program is free software: you can redistribute it and/or modify
  15    it under the terms of the GNU General Public License as published by
  16    the Free Software Foundation, either version 3 of the License, or
  17    (at your option) any later version.
  18    This program is distributed in the hope that it will be useful,
  19    but WITHOUT ANY WARRANTY; without even the implied warranty of
  20    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  21    GNU General Public License for more details.
  22    You should have received a copy of the GNU General Public License
  23    along with this program.  If not, see <https://www.gnu.org/licenses/>.
  24# Installation
  25Python 3 
  26NoisET is a python /3.6 software. It is available on PyPI and can be downloaded and installed through pip:
  27```console
  28$ pip install noisets
  29```
  30Watch out, Data pre-processing, diversity estimates and generation of neutral TCR clonal dynamics is not possible yet with installation with pip. Use only the sudo command below.
  31To install NoisET and try the tutorial dusplayed in this github: gitclone the file in your working environment. 
  32Using the terminal, go to NoisET directory and write the following command : 
  33```console
  34$ sudo python setup.py install
  35```
  36If you do not have the following python libraries (that are useful to use NoisET) : numpy, pandas, matplotlib, seaborn, scipy, scikit-learn, please do the following commands, to try first to install the dependencies separately: :
  37 ```
  38python -m pip install -U pip
  39python -m pip install -U matplotlib
  40pip install numpy
  41pip install pandas
  42pip install matplotlib
  43pip install seaborn
  44pip install -U scikit-learn
  45 ```
  46# Documentation
  47## Command lines with terminal
  48A tutorial is available at https://github.com/mbensouda/NoisET_tutorial . 
  49Three commands are available to use :
  50- `noiset-noise` To infer Null noise model: NoisET first function (1)
  51- `noiset-nullgenerator` To qualitatively check consistency of NoisET first function
  52- `noiset-detection` To detect responding clones to a stimulus: NoisET second function (2)
  53All options are described typing one of the previous commands + `--help`or `-h`. Options are also described in the following READme.
  54## 1/ Infer noise model 
  55To infer null noise model: NoisET first function (1), use the command `noiset-noise`
  56At the command prompt, type:
  57```console
  58$ noiset-noise --path 'DATA_REPO/' --f1 'FILENAME1_X_REP1' --f2 'FILENAME2_X_REP2' --(noisemodel)
  59```
  60Several options are needed to learn noise model from two replicate samples associated to one individual at a specific time point:
  61#### 1/ Data information:
  62- `--path 'PATHTODATA'`: set path to data file 
  63- `--f1 'FILENAME1_X_REP1'`: filename for individual X replicate 1 
  64- `--f2 'FILENAME2_X_REP2'`: filename for individual X replicate 2 
  65If your TCR CDR3 clonal populations features (ie clonal fractions, clonal counts, clonal nucleotide CDR3 sequences and clonal amino acid sequences) have different column names than: ('Clone fraction', 'Clone count', 'N. Seq. CDR3', 'AA. Seq. CDR3), you can specify the name directly by using: 
  66- `--specify` 
  67- `--freq 'frequency'` : Column label associated to clonal fraction 
  68- `--counts 'counts'`:  Column label associated to clonal count  
  69- `--ntCDR3 'ntCDR3'`:  Column label associated to clonal CDR3 nucleotides sequence  
  70- `--AACDR3 'AACDR3'`:  Column label associated to clonal CDR3 amino acid sequence
  71#### 2/ Choice of noise model: (parameters meaning described in Methods section)
  72- `--NBPoisson`: Negative Binomial + Poisson Noise Model - 5 parameters 
  73- `--NB`: Negative Binomial - 4 parameters  
  74- `--Poisson`: Poisson - 2 parameters 
  75#### 3/ Example:
  76At the command prompt, type:
  77```console
  78$ noiset-noise --path 'data_examples/' --f1 'Q1_0_F1_.txt.gz' --f2 'Q1_0_F2_.txt.gz' --NB
  79```
  80This command line will learn four parameters associated to negative binomial null noise Model `--NB` for individual Q1 at day 0.
  81A '.txt' file is created in the working directory: it is a 5/4/2 parameters data-set regarding on NBP/NB/Poisson noise model. In this example, it is a four parameters table (already created in data_examples repository). 
  82You can run previous examples using data (Q1 day 0/ day15) provided in the data_examples folder - data from [Precise tracking of vaccine-responding T cell clones reveals convergent and personalized response in identical twins, Pogorelyy et al, PNAS](https://www.pnas.org/content/115/50/12704) 
  83#### 4/ Example with `--specify`:
  84At the command prompt, type:
  85```console
  86$ noiset-noise --path 'data_examples/' --f1 'replicate_1_1.tsv.gz' --f2 'replicate_1_2.tsv.gz' --specify --freq 'frequencyCount' --counts 'count' --ntCDR3 'nucleotide' --AACDR3 'aminoAcid' --NB
  87```
  88As previously this command enables us to learn four parameters associated to negative binomial null noise model `--NB` for one individual in cohort produced in [Model to improve specificity for identification of clinically-relevant expanded T cells in peripheral blood, Rytlewski et al, PLOS ONE](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0213684). 
  89## 2/ Generate synthetic data from null model learning:
  90To qualitatively check consistency of NoisET first function (1) with experiments or for other reasons, it can be useful to generates synthetic replicates from the null model (described in Methods section).
  91One can also generalte healthy RepSeq samples dynamics using the noise model which has been learned in a first step anf giving the time-scale dynamics of turnover of the repertoire as defined in https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. Check [here](<https://github.com/statbiophys/NoisET/blob/master/NoisET%20example%20-%20Null%20model%20learning%20.ipynb>).  
  92To generate synthetic TCR RepSeq data replicates having chosen sampling noise characteristics, use the command `noiset-nullgenerator`
  93 ```console
  94 $ noiset-nullgenerator --(noise-model) --nullpara 'NULLPARAS' --NreadsI float --NreadsII float --Nclones float --output 'SYNTHETICDATA'  
  95 ```
  96#### 1/ Choice of noise model:
  97The user must chose one of the three possible models for the probability that a TCR has <strong> an empirical count n </strong> knowing that its  <strong> true frequency is f </strong>, P(n|f): a Poisson distribution `--Poisson`, a negative binomial distribution `--NB`, or a two-step model combining Negative-Binomial and a Poisson distribution `--NBP`. n is the empirical clone size and  depends on the experimental protocol.
  98For each P(n|f), a set of parameters is learned.
  99- `--NBPoisson`: Negative Binomial + Poisson Noise Model - 5 parameters 5 parameters described in [Puelma Touzel et al, 2020](<https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007873&rev=2>): power-law exponent of clonotypes frequencies distributions `'alph_rho'`, minimum of clonotype frequencies distribution `'fmin'`, `'beta'` and `'alpha'`, parameters of negative binomial distribution constraining mean and variance of P(m|f) distribution (m being the number of cells associated to a clonotype in the experiemental sample), and `'m_total'` the total number of cells in the sample of interest..
 100- `--NB`: Negative Binomial - 4 parameters: power-law of the clonotypes frequencies distributions (same ansatz than in [Puelma Touzel et al, 2020](<https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007873&rev=2>) `'alph_rho'`, minimum of clonotype frequencies distribution `'fmin'`, `'beta'` and `'alpha'`, parameters of negative binomial distribution constraining mean and variance of P(n|f) distribution. <em> NB(fNreads, fNreads + betafNreads<sup>alpha</sup>) </em>. (Nreads is the total number of reads in the sample of interest.) 
 101- `--Poisson`: Poisson - 2 parameters power-law of the clonotypes frequencies distributions (same ansatz than in [Puelma Touzel et al, 2020](<https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007873&rev=2>)`'alph_rho'` and minimum of clonotype frequencies distribution `'fmin'`. P(n|f) is a Poisson distribution of parameter <em> fNreads </em>. (Nreads is the total number of reads in the sample of interest.)
 102#### 2/ Specify learned noise parameters:
 103- `--nullpara 'PATHTOFOLDER/NULLPARAS.txt'`: parameters learned thanks to NoisET function (1) \
 104!!! Make sure to match correctly the noise model and the null parameter file content : 5 parameters for `--NBP`, 4 parameters for `--NB`and 2 parameters
 105for `--Poisson`. 
 106#### 3/ Sequencing properties of data:
 107- `--NreadsI NNNN`: total number  of reads in first replicate - it should match the actual data. In the example below, it is the sum of 'Clone count' in 'Q1_0_F1_.txt.gz'. 
 108- `--Nreads2 NNNN`: total number  of reads in second replicate - it should match the actual data. In the example below, it is the sum of 'Clone count' in 'Q1_0_F2_.txt.gz'. 
 109- `--Nclones NNNN`: total number of clones in union of two replicates - it should match the actual data. In the example below, it is the number of clones present in both replicates : 'Q1_0_F1_.txt.gz' and 'Q1_0_F2_.txt.gz'.
 110#### 4/ Output file
 111`--output 'SYNTHETICDATA'`: name of the output file where you can find the synthetic data set. 
 112At the command prompt, type 
 113 ```console
 114 $ noiset-nullgenerator --NB --nullpara 'data_examples/nullpara1.txt' --NreadsI 829578 --NreadsII 954389 --Nclones 776247 --output 'test'  
 115 ```
 116 Running this line, you create a 'synthetic_test.csv' file with four columns : 'Clone_count_1', 'Clone_count_2', 'Clone_fraction_1', 'Clone_fraction_2', resctively synthetic read counts and frequencies that you would have found in an experimental sample of same learned parameters 'nullpara1.txt', 'NreadsI', 'NreadsII' and 'Nclones'.
 117## 3/ Detect responding clones:
 118 
 119Detects responding clones to a stimulus: NoisET second function (2)
 120To detect responding clones from two RepSeq data at time_1 and time_2, use the command `noiset-detection`
 121```console
 122$ noiset-detection --(noisemodel)  --nullpara1 'FILEFORPARAS1' --nullpara2 'FILEFORPARAS1' --path 'REPO/' --f1 'FILENAME_TIME_1' --f2 'FILENAME_TIME_2' --pval float --smedthresh float --output 'DETECTIONDATA' 
 123```
 124Several options are needed to learn noise model from two replicate samples associated to one individual at a specific time point:
 125#### 1/ Choice of noise model:
 126- `--NBPoisson`: Negative Binomial + Poisson Noise Model - 5 parameters 
 127- `--NB`: Negative Binomial - 4 parameters  
 128- `--Poisson`: Poisson - 2 parameters 
 129#### 2/ Specify learned parameters for both time points:
 130(they can be the same for both time points if replicates are not available but to use carefully as mentioned in [ARTICLE]) 
 131- `--nullpara1 'PATH/FOLDER/NULLPARAS1.txt'`: parameters learned thanks to NoisET function (1) for time 1 
 132- `--nullpara2 'PATH/FOLDER/NULLPARAS2.txt'`: parameters learned thanks to NoisET function (1) for time 2  
 133!!! Make sure to match correctly the noise model and the null parameters file content : 5 parameters for `--NBP`, 4 parameters for `--NB`and 2 parameters
 134for `--Poisson`. 
 135#### 3/ Data information:
 136- `--path 'PATHTODATA'`: set path to data file 
 137- `--f1 'FILENAME1_X_time1'`: filename for individual X time 1 
 138- `--f2 'FILENAME2_X_time2'`: filename for individual X time 2 
 139If your TCR CDR3 clonal populations features (ie clonal fractions, clonal counts, clonal nucleotides CDR3 sequences and clonal amino acids sequences) have different column names than: ('Clone fraction', 'Clone count', 'N. Seq. CDR3', 'AA. Seq. CDR3), you can specify the name by using: 
 140- `--specify` 
 141- `--freq 'frequency'` : Column label associated to clonal fraction 
 142- `--counts 'counts'`:  Column label associated to clonal count  
 143- `--ntCDR3 'ntCDR3'`:  Column label associated to clonal CDR3 nucleotides sequence  
 144- `--AACDR3 'AACDR3'`:  Column label associated to clonal CDR3 amino acid sequence
 145#### 4/ Detection thresholds: (More details in Methods section).
 146- `--pval XXX` : p-value threshold for the expansion/contraction - use 0.05 as a default value. 
 147- `--smedthresh XXX` : log fold change median threshold for the expansion/contraction - use 0 as a default value. 
 148#### 5/ Output file
 149`--output 'DETECTIONDATA'`: name of the output file (.csv) where you can find a list of the putative responding clones with statistics features. (More details in Methods section).
 150At the command prompt, type 
 151```console
 152$ noiset-detection --NB  --nullpara1 'data_examples/nullpara1.txt' --nullpara2 'data_examples/nullpara1.txt' --path 'data_examples/' --f1 'Q1_0_F1_.txt.gz' --f2 'Q1_15_F1_.txt.gz' --pval 0.05 --smedthresh 0 --output 'detection' 
 153```
 154Ouput: table containing all putative detected clones with statistics features about logfold-change variable <em> s </em>: more theoretical description [Puelma Touzel et al, 2020](<https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007873&rev=2>).
 155## Python package
 156"""
 157
 158
 159# Import python libraries
 160import os
 161import time
 162import math
 163from copy import deepcopy
 164from decimal import Decimal
 165from functools import partial
 166
 167import matplotlib.pyplot as plt
 168from matplotlib import cm, colors, colorbar
 169import numpy as np
 170import pandas as pd
 171import seaborn as sns
 172from scipy import stats
 173from scipy.stats import nbinom
 174from scipy.stats import poisson
 175from scipy.stats import rv_discrete
 176from datetime import datetime, date
 177from scipy.optimize import minimize
 178
 179#tools for PCA
 180from sklearn.decomposition import PCA
 181from sklearn.cluster import AgglomerativeClustering
 182
 183#tools to generate RepSeq traj
 184import shutil
 185from multiprocessing import Pool, cpu_count
 186from functools import partial
 187
 188###===================================TOOLS-TO-GENERATE-NEUTRAL-TCR-REP-SEQ-TRAJECTORIES=====================================================
 189#  Library functions to generate TCR repertoires
 190##------------------------Initial-Distributions------------------------
 191def _rho_counts_theo_minus_x(A, B, N_0):
 192    
 193    # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 
 194    # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 
 195    # this function is not made for a NoisET user 
 196
 197    # I am disretizing the logspace with nfbins = 100000
 198
 199    
 200    Cmin = 1
 201    freq_dtype = 'float32'
 202
 203    N_cells = int(1e10)
 204    S_c = -(A+B/2)*N_cells/(N_0-1)
 205    
 206    alpha = -2*A/B
 207    
 208    nbins_1 = 100000
 209    
 210    logcountvec = np.linspace(np.log10(Cmin),np.log10(N_0), nbins_1)
 211    log_countvec_minus = np.array(np.log(np.power(10,logcountvec)) ,dtype=freq_dtype).flatten() 
 212    log_rho_minus = np.log(-(S_c/A))+ np.log(1-np.exp(-alpha*log_countvec_minus))
 213    
 214    N_clones_1 = -(S_c/A)*(np.log(N_0) - (1/alpha)*(1 - N_0**(-alpha)))
 215    
 216    
 217    return log_rho_minus, log_countvec_minus, N_clones_1
 218
 219def _rho_counts_theo_plus_x(A, B, N_0):
 220    # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 
 221    # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 
 222    # this function is not made for a NoisET user 
 223    # I am disretizing the logspace with nfbins = 100000, I can put a better discretization than for the minus
 224    # distribution
 225    
 226    Cmax = int(1e10)
 227    #Cmax = np.inf
 228    freq_dtype = 'float32'
 229
 230    N_cells = int(1e10)
 231    S_c = -(A+B/2)*N_cells/(N_0 -1)
 232    
 233    alpha = -2*A/B
 234    
 235    nbins_2 = 100000
 236    
 237    logcountvec = np.linspace(np.log10(N_0),np.log10(Cmax), nbins_2 )
 238    log_countvec_plus = np.array(np.log(np.power(10,logcountvec)) ,dtype=freq_dtype).flatten() 
 239    log_rho_plus = np.log(N_0**alpha-1) + np.log(-(S_c/A)) -(alpha)*log_countvec_plus
 240    
 241    N_clones_2 = -(S_c/(A*alpha))*(1 - N_0**(-alpha))
 242    
 243
 244    return log_rho_plus, log_countvec_plus, N_clones_2
 245
 246
 247def _get_distsample(pmf,Nsamp, dtype='uint32'):
 248
 249    # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 
 250    # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 
 251    # this function is not made for a NoisET user 
 252    '''
 253    generates Nsamp index samples of dtype (e.g. uint16 handles up to 65535 indices) from discrete probability mass function pmf.
 254    Handles multi-dimensional domain. N.B. Output is sorted.
 255    '''
 256    #assert np.sum(pmf)==1, "cmf not normalized!"
 257    
 258    shape = np.shape(pmf)
 259    sortindex = np.argsort(pmf, axis=None)#uses flattened array
 260    pmf = pmf.flatten()
 261    pmf = pmf[sortindex]
 262    cmf = np.cumsum(pmf)
 263   #print('cumulative distribution is equal to: ' + str(cmf[-1]))
 264    choice = np.random.uniform(high = cmf[-1], size = int(float(Nsamp)))
 265    index = np.searchsorted(cmf, choice)
 266    index = sortindex[index]
 267    index = np.unravel_index(index, shape)
 268    index = np.transpose(np.vstack(index))
 269    sampled_inds = np.array(index[np.argsort(index[:,0])], dtype=dtype)
 270    return sampled_inds
 271
 272##------------------------Propagator------------------------
 273def _gaussian_matrix(x_vec, x_i_vec_unique, A, B, t):
 274
 275    # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 
 276    # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 
 277    # this function is not made for a NoisET user
 278    x_vec_reshaped = np.reshape(x_vec, (len(x_vec), 1))
 279    ones_vec = np.ones((len(x_i_vec_unique), 1))
 280    M = np.multiply(ones_vec, x_vec_reshaped.T)
 281    x_i_unique_reshaped = np.reshape(x_i_vec_unique, (len(x_i_vec_unique), 1))
 282    
 283    return (1/np.sqrt(2*np.pi*B*t))*np.exp((-1/(2*B*t))*(M - x_i_unique_reshaped - A*t)**2)
 284
 285def _gaussian_adsorption_matrix(x_vec, x_i_vec_unique, A, B, t):
 286
 287    # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 
 288    # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 
 289    # this function is not made for a NoisET user
 290    a = 0
 291    gauss = _gaussian_matrix(x_vec, x_i_vec_unique, A, B, t)
 292    gauss_a = _gaussian_matrix(x_vec, 2*a-x_i_vec_unique, A, B, t)
 293    x_i_unique_reshaped = np.reshape(x_i_vec_unique, (len(x_i_vec_unique), 1))
 294    return gauss - np.exp((A*(a-x_i_unique_reshaped))/(B/2)) * gauss_a
 295
 296def _extinction_vector(x_i, A, B, t): 
 297
 298    # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 
 299    # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 
 300    # this function is not made for a NoisET user
 301    nbins = 2000
 302    eps = 1e-20
 303    #eps = 0
 304    x_vec = np.linspace(eps, np.max(x_i) - A*t + 3*np.sqrt(B*t), nbins)
 305    
 306    x_i_sorted = np.sort(x_i)
 307    
 308    xiind_vals, xi_start_ind, xi_counts=np.unique(x_i_sorted, return_counts=True,return_index=True)
 309    Prop_Matrix = _gaussian_adsorption_matrix(x_vec, xiind_vals, A, B, t)
 310    
 311    dx =np.asarray(np.diff(x_vec)/2., dtype='float32')
 312    integ = np.sum(dx*(Prop_Matrix[:, 1:]+Prop_Matrix[:, :-1]), axis = 1)
 313    p_ext = 1 - integ
 314    
 315    p_ext_new = np.zeros((len(x_i)))
 316    for it,xiind in enumerate(xiind_vals):
 317        p_ext_new[xi_start_ind[it]:xi_start_ind[it]+xi_counts[it]] = p_ext[it]
 318        
 319    test = np.random.uniform(0,1, size = (len(p_ext_new))) > p_ext_new
 320    results_extinction = test.astype(int)
 321    
 322    return results_extinction, Prop_Matrix, x_vec, xiind_vals, xi_start_ind, xi_counts, p_ext
 323
 324#------------------------Source-term-no-frequency-dependency------------------------
 325
 326def _gaussian_matrix_time(x_vec, x_i_scal, A, B, tvec_unique):
 327
 328    # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 
 329    # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 
 330    # this function is not made for a NoisET user
 331    
 332    x_vec_reshaped = np.reshape(x_vec, (len(x_vec), 1))
 333    ones_vec = np.ones((len(tvec_unique), 1))
 334    M = np.multiply(ones_vec, x_vec_reshaped.T)
 335    tvec_unique_reshaped = np.reshape(tvec_unique, (len(tvec_unique), 1))
 336    #x_i_unique_reshaped = np.reshape(x_i_vec_unique, (len(x_i_vec_unique), 1))
 337    
 338    return (1/np.sqrt(2*np.pi*B*tvec_unique_reshaped))*np.exp((-1/(2*B*tvec_unique_reshaped))*(M - x_i_scal - A*tvec_unique_reshaped)**2)
 339
 340def _gaussian_adsorption_matrix_time(x_vec, x_i_scal, A, B, tvec_unique):
 341
 342    # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 
 343    # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 
 344    # this function is not made for a NoisET user
 345    
 346    a = 0
 347    gauss = _gaussian_matrix_time(x_vec, x_i_scal, A, B, tvec_unique)
 348    gauss_a = _gaussian_matrix_time(x_vec, 2*a-x_i_scal, A, B, tvec_unique)
 349    
 350    return gauss - np.exp((A*(a-x_i_scal))/(B/2)) * gauss_a
 351
 352def _Prop_Matrix_source( A, B, tvec): 
 353    
 354    # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 
 355    # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 
 356    # this function is not made for a NoisET user
 357
 358    nbins = 2000
 359    N_0 = 40
 360    x_i_scal = np.log(N_0)
 361    t = np.max(tvec)
 362    x_vec = np.linspace(0, x_i_scal - A*t + 2*np.sqrt(B*t), nbins)
 363    
 364    tvec_sorted = np.sort(tvec)
 365    
 366    tiind_vals, ti_start_ind, ti_counts=np.unique(tvec_sorted, return_counts=True,return_index=True)
 367    Prop_Matrix = _gaussian_adsorption_matrix_time(x_vec, x_i_scal, A, B, tiind_vals)
 368    
 369    dx =np.asarray(np.diff(x_vec)/2., dtype='float32')
 370    integ = np.sum(dx*(Prop_Matrix[:, 1:]+Prop_Matrix[:, :-1]), axis = 1)
 371    
 372    return Prop_Matrix, x_vec, tiind_vals, ti_start_ind, ti_counts, integ
 373
 374def _extinction_vector_source(A, B, tvec): 
 375
 376    # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 
 377    # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 
 378    # this function is not made for a NoisET user
 379    
 380    nbins = 2000
 381    N_0 = 40
 382    x_i_scal = np.log(N_0)
 383    t = np.max(tvec)
 384    x_vec = np.linspace(0, x_i_scal - A*t + 2*np.sqrt(B*t), nbins)
 385    
 386    tvec_sorted = np.sort(tvec)
 387    
 388    tiind_vals, ti_start_ind, ti_counts=np.unique(tvec_sorted, return_counts=True,return_index=True)
 389    Prop_Matrix = _gaussian_adsorption_matrix_time(x_vec, x_i_scal, A, B, tiind_vals)
 390    
 391    dx =np.asarray(np.diff(x_vec)/2., dtype='float32')
 392    integ = np.sum(dx*(Prop_Matrix[:, 1:]+Prop_Matrix[:, :-1]), axis = 1)
 393    p_ext = 1 - integ
 394    
 395    p_ext_new = np.zeros((len(tvec)))
 396    for it,tiind in enumerate(tiind_vals):
 397        p_ext_new[ti_start_ind[it]:ti_start_ind[it]+ti_counts[it]] = p_ext[it]
 398        
 399    test = np.random.uniform(0,1, size = (len(p_ext_new))) > p_ext_new
 400    results_extinction = test.astype(int)
 401    
 402
 403    return results_extinction, Prop_Matrix, x_vec, tiind_vals, ti_start_ind, ti_counts, p_ext
 404
 405##------------------------Function-to-generate-in-silico-Rep-Seq-samples------------------------
 406
 407def _generator_diffusion_LB(A, B, N_0, t):
 408    
 409    # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 
 410    # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 
 411    # this function is not made for a NoisET user
 412    
 413    eps = 1e-20
 414    
 415    ## Choose initial size of the immune system to be 1e10 (for a mouse)
 416    N_cells = int(1e10)
 417    
 418    #Parameters for the repertoire generation
 419    alpha_rho = -1 + 2*A/B
 420    N_ext = 1
 421    freq_dtype = 'float32' 
 422    
 423    #==========================generate the steady state distribution===============================
 424    
 425    #for counts < N0: 
 426    logrhofvec,logfvec, N_clones_1 = _rho_counts_theo_minus_x(A, B, N_0)
 427    dlogfby2=np.asarray(np.diff(logfvec)/2., dtype='float32')
 428    integ=np.exp(logrhofvec[np.newaxis,:])
 429    f_samples_inds=_get_distsample(np.asarray((dlogfby2[np.newaxis,:]*(integ[:,1:]+integ[:,:-1])).flatten(),dtype='float32'), N_clones_1,dtype='uint32').flatten()
 430    #print("generation population smaller than N_0: check")
 431    
 432    logcvec_generated = logfvec[f_samples_inds]
 433    counts_generated = np.exp(logcvec_generated)
 434    C_f_minus = np.sum(counts_generated)
 435    print(str(C_f_minus) + ' cells smaller than N_0')
 436    log_cminus_generated = logcvec_generated
 437    logrhofvec_1,logfvec_1 = logrhofvec,logfvec
 438    print(str(N_clones_1) + ' N_clones_1')
 439    
 440    #for counts > N0:
 441    
 442    logrhofvec,logfvec, N_clones_2 = _rho_counts_theo_plus_x(A, B, N_0)
 443    dlogfby2=np.asarray(np.diff(logfvec)/2., dtype='float32')
 444    integ=np.exp(logrhofvec[np.newaxis,:])
 445    f_samples_inds=_get_distsample(np.asarray((dlogfby2[np.newaxis,:]*(integ[:,1:]+integ[:,:-1])).flatten(),dtype='float32'),N_clones_2,dtype='uint32').flatten()
 446    #print("generation population larger than N_0: check")
 447    logcvec_generated = logfvec[f_samples_inds]
 448    counts_generated = np.exp(logcvec_generated)
 449    C_f_plus = np.sum(counts_generated)
 450    print(str(C_f_plus) + ' N_cells larger than N_0')
 451    log_cplus_generated = logcvec_generated
 452    logrhofvec_2,logfvec_2 = logrhofvec,logfvec
 453    print(str(N_clones_2) + ' N_clones_2')
 454    
 455    #===================================================
 456    
 457    N_clones = int(N_clones_1 + N_clones_2)
 458    print('N_clones= ' + str(N_clones))
 459
 460    S_c = - (A + B/2)*(N_cells/(N_0-1))
 461    print('N_clones_theory= ' + str(-(S_c/A)*np.log(N_0)))
 462    
 463    
 464    x_i = np.concatenate((log_cminus_generated, log_cplus_generated), axis = None)
 465    
 466    N_total_cells_generated = np.sum(np.exp(x_i))
 467    print("N_total_cells_generated/N_total_cells:" + str(N_total_cells_generated/N_cells))
 468    
 469    
 470    
 471    results_extinction, Prop_Matrix, x_vec, xiind_vals, xi_start_ind, xi_counts, p_ext = _extinction_vector(x_i, A, B, t)
 472    #x_vec = np.linspace(0, 30*B*t, 2000)
 473    dx=np.asarray(np.diff(x_vec)/2., dtype='float32')
 474    
 475    x_i_noext= x_i[np.where(results_extinction ==1)]
 476    x_f = np.zeros((len(x_i)))
 477    
 478    for i in range(len(xiind_vals)): 
 479        
 480        
 481        if (np.dot(dx, Prop_Matrix[i,1:] + Prop_Matrix[i, :-1])) < 1e-7:
 482            pass
 483        
 484        else:
 485        
 486            Prop_adsorp = Prop_Matrix[i,:] / (np.dot(dx, Prop_Matrix[i,1:] + Prop_Matrix[i, :-1]))
 487
 488            integ = Prop_adsorp[np.newaxis,:]
 489            f_samples_inds = _get_distsample(np.asarray((dx[np.newaxis,:]*(integ[:,1:]+integ[:,:-1])).flatten(),dtype='float32'), xi_counts[i],dtype='uint32').flatten()
 490
 491            x_f[xi_start_ind[i]:xi_start_ind[i]+xi_counts[i]] = x_vec[f_samples_inds]
 492    
 493    x_f = np.multiply(x_f,results_extinction)
 494    
 495
 496    x_f[x_f == 0] = -np.inf
 497    
 498    N_extinction = np.sum(1- results_extinction)
 499    N_extinction = len(x_f[x_f == -np.inf])
 500    
 501    print('Number of extinction= ' + str(N_extinction))
 502    sim_ext = (N_extinction/len(results_extinction))*100
 503    theo_ext = (-A/np.log(N_0))*100
 504    print('simulations % of extinction= ' + str((N_extinction/len(results_extinction))*100/t) + '%')
 505    print('theoretical % of extinction= ' + str((-A/np.log(N_0))*100) + '%')
 506        
 507
 508    #Source term
 509
 510    N_source = S_c*t
 511
 512    print('Number of insertions= ' +str(N_source))
 513
 514    N_source = int(N_source)
 515
 516    eps = 1e-8
 517    time_vec_span = np.linspace(eps, t, 5000)
 518    time_vec = np.random.choice(time_vec_span, N_source)
 519    time_vec = np.sort(time_vec)
 520    
 521    results_extinction_source, Prop_Matrix_source, x_vec_source, tiind_vals, ti_start_ind, ti_counts, p_ext_source = _extinction_vector_source(A, B, time_vec)
 522
 523    dx_source=np.asarray(np.diff(x_vec_source)/2., dtype='float32')
 524
 525    x_source_LB = np.zeros((N_source))
 526    for i in range(len(tiind_vals)): 
 527        
 528        if (np.dot(dx_source, Prop_Matrix_source[i,1:] + Prop_Matrix_source[i, :-1])) < 1e-7:
 529            pass
 530        
 531        else:
 532            Prop_adsorp_s = Prop_Matrix_source[i,:]
 533            Prop_adsorp_s = Prop_Matrix_source[i,:] / (np.dot(dx_source, Prop_Matrix_source[i,1:] + Prop_Matrix_source[i, :-1]))
 534
 535
 536            integ = Prop_adsorp_s[np.newaxis,:]
 537            f_samples_inds_s = _get_distsample(np.asarray((dx_source[np.newaxis,:]*(integ[:,1:]+integ[:,:-1])).flatten(),dtype='float32'), ti_counts[i],dtype='uint32').flatten()
 538
 539            x_source_LB[ti_start_ind[i]:ti_start_ind[i]+ti_counts[i]] = x_vec_source[f_samples_inds_s]
 540            
 541        
 542    x_source_LB = np.multiply(x_source_LB, results_extinction_source)
 543    
 544    x_source_LB[x_source_LB == 0] = -np.inf
 545
 546
 547    
 548    return x_i, x_f, Prop_Matrix, p_ext, results_extinction, time_vec, results_extinction_source, x_source_LB
 549
 550def _experimental_sampling_diffusion_Poisson(NreadsI, NreadsII, x_0, x_2, t, N_cell_0, N_cell_2):
 551
 552
 553    # This function has been made to generate TCR clonal frequencies distribution from the theoretical model described in the paper 
 554    # https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. 
 555    # this function is not made for a NoisET user
 556    
 557    
 558    #----------------------------Counts generation --------------------------------------------
 559    
 560    ##Initial condition
 561    N_total_0 = len(x_0[x_0 != -np.inf])
 562    x_0_bis = x_0[x_0 != -np.inf]
 563    
 564    print('Number of clones at initial time ' + str(N_total_0))
 565    
 566    N_total_2 = len(x_2[x_2 != -np.inf])
 567    x_2_bis = x_2[x_2 != -np.inf]
 568    
 569    
 570    print('Number of clones after ' + str(t) + ' year(s) ' +  str(N_total_2))
 571    
 572    #N_total = min(N_total_0, N_total_2)
 573    assert len(x_0) == len(x_2)
 574    N_total = len(x_0)
 575    
 576    x_2_final = x_2[:N_total]
 577    
 578
 579    f_vec_initial = np.exp(x_0)/N_cell_0
 580    m=float(NreadsI)*f_vec_initial
 581    n_counts_day_0 = np.random.poisson(m, size =(1, int(N_total)))
 582    n_counts_day_0 = n_counts_day_0[0,:]
 583    
 584    #print('done')
 585    
 586    #Final condition
 587    f_vec_end = np.exp(x_2_final)/N_cell_2
 588    m=float(NreadsII)*f_vec_end
 589    #print(m)
 590    print('MEAN N : ' + str(np.mean(m)))
 591    n_counts_day_1 = np.random.poisson(m, size =(1, int(N_total)))
 592    print(n_counts_day_1)
 593    n_counts_day_1 = n_counts_day_1[0,:]
 594    
 595
 596    #-------------------------------Creation of the data set-------------------------------------
 597    
 598    obs=np.logical_or(n_counts_day_0>0, n_counts_day_1>0)
 599    n1_samples=n_counts_day_0[obs]
 600    n2_samples=n_counts_day_1[obs]
 601    pair_samples_df= pd.DataFrame({'Clone_count_1':n1_samples,'Clone_count_2':n2_samples})
 602    
 603    pair_samples_df['Clone_frequency_1'] = pair_samples_df['Clone_count_1'] / np.sum(pair_samples_df['Clone_count_1'])
 604    pair_samples_df['Clone_frequency_2'] = pair_samples_df['Clone_count_2'] / np.sum(pair_samples_df['Clone_count_2'])
 605    
 606    
 607    return pair_samples_df
 608
 609def _experimental_sampling_diffusion_NegBin(NreadsI, NreadsII, paras, x_0, x_2, N_cell_0, N_cell_2):
 610    
 611    
 612    #----------------------------Counts generation --------------------------------------------
 613    
 614    ##Initial condition
 615    N_total_0 = len(x_0[x_0 != -np.inf])
 616    x_0_bis = x_0[x_0 != -np.inf]
 617    
 618    print('Number of clones at initial time ' + str(N_total_0))
 619    
 620    N_total_2 = len(x_2[x_2 != -np.inf])
 621    x_2_bis = x_2[x_2 != -np.inf]
 622    
 623    print('Number of clones after 2 years ' + str(N_total_2))
 624    
 625    #N_total = min(N_total_0, N_total_2)
 626    assert len(x_0) == len(x_2)
 627    N_total = len(x_0)
 628    
 629
 630    f_vec_initial = np.exp(x_0)/N_cell_0
 631    m=float(NreadsI)*f_vec_initial
 632    print(m)
 633 
 634    beta_mv=paras[1]
 635    alpha_mv=paras[2]
 636   
 637    v=m+beta_mv*np.power(m,alpha_mv)
 638
 639    pvec=1-m/v
 640    nvec=m*m/v/pvec
 641
 642    pvec = np.nan_to_num(pvec, nan=0.0)
 643    nvec = np.nan_to_num(nvec, nan=1e-30)
 644
 645    print(pvec)
 646    print(1-pvec)
 647    print(np.sum(pvec>=1))
 648    print(nvec)
 649
 650    n_counts_day_0 = np.random.negative_binomial(nvec, 1-pvec, size =(1, int(N_total)))
 651    n_counts_day_0 = n_counts_day_0[0,:]
 652    print(n_counts_day_0)
 653    
 654    
 655    #Final condition
 656    f_vec_end = np.exp(x_2)/N_cell_2
 657    m_end=float(NreadsII)*f_vec_end
 658    print(m_end)
 659
 660    v_end=m_end+beta_mv*np.power(m_end,alpha_mv)
 661    pvec_end=1-m_end/v_end
 662    nvec_end=m_end*m_end/v_end/pvec_end
 663
 664    pvec_end = np.nan_to_num(pvec_end, nan=0.0)
 665    nvec_end = np.nan_to_num(nvec_end, nan=1e-30)
 666
 667
 668    n_counts_day_1 = np.random.negative_binomial(nvec_end, 1-pvec_end, size =(1, int(N_total)))
 669    n_counts_day_1 = n_counts_day_1[0,:]
 670    print(n_counts_day_1)
 671
 672
 673    #-------------------------------Creation of the data set-------------------------------------
 674    
 675    obs=np.logical_or(n_counts_day_0>0, n_counts_day_1>0)
 676    n1_samples=n_counts_day_0[obs]
 677    n2_samples=n_counts_day_1[obs]
 678    pair_samples_df= pd.DataFrame({'Clone_count_1':n1_samples,'Clone_count_2':n2_samples})
 679    
 680    pair_samples_df['Clone_frequency_1'] = pair_samples_df['Clone_count_1'] / np.sum(pair_samples_df['Clone_count_1'])
 681    pair_samples_df['Clone_frequency_2'] = pair_samples_df['Clone_count_2'] / np.sum(pair_samples_df['Clone_count_2'])
 682    
 683    
 684    return pair_samples_df
 685
 686#==========================================================================================================================
 687
 688
 689#===============================Longitudinal-Data-Pre-Processing===================================
 690
 691class longitudinal_analysis():
 692
 693    """
 694    This class provides some tool to inspect and compute some simple statistics on longitudinal data associated with
 695    one individual (it is independent of the NoisET software).
 696    
 697    ...
 698    Attributes
 699    ----------
 700    clone_count_label : str
 701        label in the clonotype tables indicating the clonotype count
 702    seq_label : str
 703        label in the clonotype tables indicating the sequence of the receptor
 704    clones : dict of pandas.DataFrame
 705        dictionary containing the clonotype tables as pandas frames. The keys are 
 706        strings "patient_time", replicated are merged. Created in the initalization
 707    times : list of float
 708        ordered times of the imported tables. Created in the initialization
 709    unique_clones : list of str
 710        list of all the unique clonotype sequences in all the time points
 711    time_occurrence : list of int
 712        number of time points in which each clonotype appears. The index
 713        refers to the clonotype in the unique_clones list
 714    Methods
 715    -------
 716    compute_clone_time_occurrence()
 717        It creates two new attribues: the list of uniqe clonotypes in all the dataset 
 718        "self.unique_clones" and the time occurrence of each of them "self.time_occurrence".
 719        the time occurrence is the number of time points in which the clone appears.
 720    plot_hist_persistence(figsize=(12,10))
 721        It plots the distribution of time occurrence of the unique clonotypes
 722    top_clones_set(n_top_clones)
 723        Compute the set of top clones as the union of the "n_top_clones" most abundant
 724        clonotype in each time point
 725    build_traj_frame(top_clones_set)
 726        Compute the set of top clones as the union of the "n_top_clones" most abundant
 727        clonotype in each time point
 728    plot_trajectories(n_top_clones, colormap='viridis', figsize=(12,10))
 729        Function to plot the trajectories of the first "n_top_clones". Colors of the
 730        trajectories represent the cumulative frequency in all the time points.
 731    PCA_traj(n_top_clones, nclus=4)
 732        Perform PCA over the normalized trajectories of n_top_clones TCR clones.
 733        The normalization consists in dividing the whole trajectory by its maximum value.
 734        After PCA the trajectories are clustered in the two principal componets space
 735        with a hierarchical clustering algorithm.
 736    plot_PCA2(n_top_clones, nclus=4, colormap='viridis', figsize=(12,10))
 737        Plotting the trajectories in the space of their two principal components and
 738        clustering them as in "PCA_traj".
 739    plot_PCA_clusters_traj(n_top_clones, nclus=4, colormap='viridis', figsize=(12,10))
 740        Plotting the trajectories grouped by PCA clusters
 741    """
 742
 743
 744
 745
 746    def __init__(self, patient, data_folder, sequence_label='N. Seq. CDR3', clone_count_label='Clone count',
 747                 replicate1_label='_F1', replicate2_label='_F2', separator='\t'):
 748        """ 
 749        Import all the clonotypes of a given patient and store them in the dictionary "self.clones".
 750        It also creates the list of times "self.times". During this process the replicates at the
 751        same time points are merged together.
 752        The names of the tables containing TCR should be structured as "patient_time_replicate.csv".
 753        Those tables should be cvs files compressed in a zip archive (see the example notebook).
 754        Parameters
 755        ----------
 756        patient : str
 757            The ID of the patient
 758        data_folder : str
 759            folder name containing the csv files listing the T-cell receptors
 760        separator : str
 761            separator symbol in the csv tables
 762        """
 763        
 764        self.clone_count_label = clone_count_label
 765        self.seq_label = sequence_label
 766        self.unique_clones = None
 767        self.time_occurrence = None
 768        self.times = []
 769        clones_repl = dict()
 770
 771        # Iteration over all the file in the folder for importing each table
 772        for file_name in os.listdir(data_folder):
 773        # If the name before the underscore corresponds to the chosen patient..
 774            if file_name.split('_')[0] == patient:
 775                # Import the table
 776                frame = pd.read_csv(data_folder+file_name, sep='\t', compression=dict(method='zip'))
 777                # Store it in a dictionary where the key contains the patient, the time
 778                # and the replicate.
 779                clones_repl[file_name[:-10]] = frame
 780                # Reading the time from the name and storing it
 781                self.times.append(int(file_name.split('_')[1]))
 782                print('Clonotypes',file_name[:-10],'imported')
 783
 784        # Sorting the unique times
 785        self.times = np.sort(list(set(self.times)))
 786        self.clones = self._merge_replicates(patient, clones_repl, replicate1_label, replicate2_label)
 787        
 788
 789    def _merge_replicates(self, patient, clones_repl, repl1_label, repl2_label):
 790        
 791        clones_merged = dict()
 792
 793        # Iteration over the times
 794        for it, t in enumerate(self.times):
 795            # Building the ids correponding at 1st and 2nd replicate at given time point
 796            id_F1 = patient + '_' + str(t) + repl1_label
 797            id_F2 = patient + '_' + str(t) + repl2_label
 798            # Below all the rows of one table are appended to the rows of the other
 799            merged_replicates = clones_repl[id_F1].merge(clones_repl[id_F2], how='outer')
 800            # But there are common clonotypes that now appear in two different rows 
 801            # (one for the first and one for the second replicate)! 
 802            # Below we collapse those common sequences and the counts of the two are summed 
 803            merged_replicates = merged_replicates.groupby(self.seq_label, as_index=False).agg({self.clone_count_label:sum})
 804            depth = merged_replicates[self.clone_count_label].sum()
 805            merged_replicates['Clone freq'] = merged_replicates[self.clone_count_label] / depth
 806            merged_replicates = merged_replicates.sort_values('Clone freq', ascending=False)
 807            # The merged table is then added to the dictionary
 808            clones_merged[patient + '_' + str(t)] = merged_replicates
 809
 810        return clones_merged
 811
 812    
 813    def compute_clone_time_occurrence(self):
 814
 815        """
 816        It creates two new attribues: the list of uniqe clonotypes in all the dataset 
 817        "self.unique_clones" and the time occurrence of each of them "self.time_occurrence".
 818        the time occurrence is the number of time points in which the clone appears.
 819        """
 820
 821        all_clones = np.array([])
 822        for id_, cl in self.clones.items():
 823            all_clones = np.append(all_clones, cl[self.seq_label].values)
 824
 825        # The following function returns the list of unique clonotypes and the number of
 826        # repetitions for each of them. 
 827        # Note that the number of repetitions is exactly the time occurrence
 828        self.unique_clones, self.time_occurrence = np.unique(all_clones, return_counts=True)
 829
 830
 831    def plot_hist_persistence(self, figsize=(12,10)):
 832
 833        """
 834        It plots the distribution of time occurrence of the unique clonotypes
 835        Parameters
 836        ----------
 837        figsize : tuple
 838            width, height in inches
 839        
 840        Returns
 841        -------
 842        ax : matplotlib.axes._subplots.AxesSubplot
 843            axes where to draw the plot
 844        fig : matplotlib.figure.Figure
 845            matplotlib figure
 846        """
 847
 848        if type(self.unique_clones) != np.ndarray:
 849            self.compute_clone_time_occurrence()
 850            
 851        fig, ax = plt.subplots(figsize=figsize)
 852
 853        plt.rc('xtick', labelsize = 30)
 854        plt.rc('ytick', labelsize = 30)
 855
 856        ax.set_yscale('log')
 857        ax.set_xlabel('Time occurrence', fontsize = 30)
 858        ax.set_ylabel('Counts', fontsize = 30)
 859        ax.hist(self.time_occurrence, bins=np.arange(1,len(self.times)+2)-0.5, rwidth=0.6)
 860        
 861        return fig, ax
 862        
 863
 864    def top_clones_set(self, n_top_clones):
 865        
 866        """ 
 867        Compute the set of top clones as the union of the "n_top_clones" most abundant
 868        clonotype in each time point
 869        Parameters
 870        ----------
 871        n_top_clones : int
 872            number of most abundant clontypes in each time point
 873        Returns
 874        -------
 875        top_clones : set of str
 876            set of top clones
 877        """
 878
 879        top_clones = set()
 880        for id_, cl in self.clones.items():
 881            top_clones_at_time = cl.sort_values(self.clone_count_label, ascending=False)[:n_top_clones]
 882            top_clones = top_clones.union(top_clones_at_time[self.seq_label].values)
 883        return top_clones
 884    
 885
 886    def build_traj_frame(self, clone_set):
 887        
 888        """ 
 889        This builds a dataframe containing the frequency at all the time points for each 
 890        of the clonotypes specified in clone_set.
 891        The dataframe has also a field that contains the cumulative frequency.
 892        Parameters
 893        ----------
 894        clones_set : iterable of str
 895            list of clonotypes whose temporal trajectory is drawn
 896        Returns
 897        -------
 898        traj_frame : pandas.DataFrame
 899            dataframe containing the frequency at all the time points
 900        """
 901
 902        traj_frame = pd.DataFrame(index=clone_set)
 903        traj_frame['Clone cumul freq'] = 0
 904
 905        for id_, cl in self.clones.items(): 
 906
 907            # Getting the time from the index of clones_merged
 908            t = id_.split('_')[1]
 909            # Selecting the clonotypes that are both in the frame at the given time 
 910            # point and in the list of top_clones_set
 911            top_clones_at_time = clone_set.intersection(set(cl[self.seq_label]))
 912            # Creating a sub-dataframe containing only the clone in top_clones_at_time
 913            clones_at_time = cl.set_index(self.seq_label).loc[top_clones_at_time]
 914            # Creating a new column in the trajectory frames for the counts at that time
 915            traj_frame['t'+str(t)] = traj_frame.index.map(clones_at_time['Clone freq'].to_dict())
 916            # The clonotypes not present at that time are NaN. Below we convert NaN in 0s
 917            traj_frame = traj_frame.fillna(0)
 918            # The cumulative count for each clonotype is updated
 919            traj_frame['Clone cumul freq'] += traj_frame['t'+str(t)]
 920        
 921        return traj_frame 
 922
 923
 924
 925    # Plot clonal trajectories
 926
 927
 928    def plot_trajectories(self, n_top_clones, colormap='viridis', figsize=(12,10)):
 929
 930        """
 931        Function to plot the trajectories of the first "n_top_clones". Colors of the
 932        trajectories represent the cumulative frequency in all the time points.
 933        
 934        Parameters
 935        ----------
 936        n_top_clones : int
 937            number of most abundant clontypes in each time point
 938        colormap  : str
 939            colors of the trajectories
 940            
 941        figsize : tuple
 942            width, height in inches
 943        Returns
 944        -------
 945        ax : matplotlib.axes._subplots.AxesSubplot
 946            axes where to draw the plot
 947        fig : matplotlib.figure.Figure
 948            matplotlib figure
 949        """
 950
 951        cmap = cm.get_cmap(colormap)
 952        top_clones = self.top_clones_set(n_top_clones)
 953        traj_frame = self.build_traj_frame(top_clones)
 954        
 955        fig, ax = plt.subplots(figsize=figsize)
 956        plt.rc('xtick', labelsize = 30)
 957        plt.rc('ytick', labelsize = 30)
 958        ax.set_yscale('log')
 959        ax.set_xlabel('time', fontsize = 25)
 960        ax.set_ylabel('frequency', fontsize = 25)
 961
 962        log_counts = np.log10(traj_frame['Clone cumul freq'].values)
 963        max_log_count = max(log_counts)
 964        min_log_count = min(log_counts)
 965
 966        for id_, row in traj_frame.iterrows():
 967            traj = row.drop(['Clone cumul freq']).to_numpy()
 968            log_count = np.log10(row['Clone cumul freq'])
 969            norm_log_count = (log_count-min_log_count)/(max_log_count-min_log_count)
 970            plt.plot(self.times, traj, c=cmap(norm_log_count))
 971
 972
 973        sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=min(log_counts), vmax=max(log_counts)))
 974        cb = plt.colorbar(sm)
 975        cb.set_label('Log10 cumulative frequency', fontsize = 25)
 976
 977        return fig, ax
 978    
 979
 980    def PCA_traj(self, n_top_clones, nclus=4):
 981
 982        """
 983        Perform PCA over the normalized trajectories of n_top_clones TCR clones.
 984        The normalization consists in dividing the whole trajectory by its maximum value.
 985        After PCA the trajectories are clustered in the two principal componets space
 986        with a hierarchical clustering algorithm.
 987        
 988        Parameters
 989        ----------
 990        n_top_clones : int
 991            number of most abundant clontypes in each time point to consider in the PCA
 992        nclus : float
 993            number of clusters 
 994        
 995        Returns
 996        -------
 997        pca : sklearn.decomposition._pca.PCA
 998            object containing the result of the principal component analysis
 999            
1000        clustering : sklearn.cluster._agglomerative.AgglomerativeClustering
1001            object containing the result of the hierarchical clustering
1002        """
1003
1004        #Getting the top n_top_clones clonotypes at each time point
1005        top_clones = self.top_clones_set(n_top_clones)
1006        #Building a trajectory dataframe
1007        traj_frame = self.build_traj_frame(top_clones)
1008
1009        #Converting it in a numpy matrix
1010        traj_matrix = traj_frame.drop(['Clone cumul freq'], axis = 1).to_numpy()
1011
1012        # Normalize each trajectory by its maximum
1013        norm_traj_matrix = traj_matrix/np.max(traj_matrix, axis=1)[:, np.newaxis]
1014
1015        pca = PCA(n_components =2).fit(norm_traj_matrix.T)
1016        clustering = AgglomerativeClustering(n_clusters = nclus)
1017        clustering = clustering.fit(pca.components_.T)
1018
1019        return pca, clustering
1020
1021
1022    def plot_PCA2(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)):
1023
1024        """
1025        Plotting the trajectories in the space of their two principal components and
1026        clustering them as in "PCA_traj".
1027        
1028        Parameters
1029        ----------
1030        n_top_clones : int
1031            number of most abundant clontypes in each time point to consider in the PCA
1032        nclus : float
1033            number of clusters 
1034        colormap : str
1035            colormap indicating the different clusters
1036        figsize : tuple
1037            width, height in inches
1038        Returns
1039        -------
1040        ax : matplotlib.axes._subplots.AxesSubplot
1041            axes where to draw the plot
1042        fig : matplotlib.figure.Figure
1043            matplotlib figure
1044        """
1045
1046
1047        cmap = cm.get_cmap(colormap)
1048        pca, clustering = self.PCA_traj(n_top_clones, nclus)
1049
1050        fig, ax = plt.subplots(figsize=figsize)
1051        ax.set_title('PCA components (%i trajs)' %pca.n_features_, fontsize = 25)
1052        ax.set_xlabel('First component (expl var: %3.2f)'%pca.explained_variance_ratio_[0], fontsize = 25)
1053        ax.set_ylabel('Second component (expl var: %3.2f)'%pca.explained_variance_ratio_[1], fontsize = 25)
1054        for c_ind in range(clustering.n_clusters):
1055            x = pca.components_[0][clustering.labels_ == c_ind]
1056            y = pca.components_[1][clustering.labels_ == c_ind]
1057            ax.scatter(x, y, alpha=0.2, color=cmap(c_ind/clustering.n_clusters))
1058        
1059        return fig, ax
1060    
1061
1062    def plot_PCA_clusters_traj(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)):
1063
1064        """
1065        Plotting the trajectories grouped by PCA clusters
1066        
1067        Parameters
1068        ----------
1069        n_top_clones : int
1070            number of most abundant clontypes in each time point to consider in the PCA
1071        nclus : float
1072            number of clusters 
1073        colormap : str
1074            colormap indicating the different clusters
1075        figsize : tuple
1076            width, height in inches
1077        Returns
1078        -------
1079        axs : tuple of matplotlib.axes._subplots.AxesSubplot
1080            axis where to draw the plot
1081        fig : matplotlib.figure.Figure
1082            matplotlib figure
1083        """
1084
1085        cmap = cm.get_cmap(colormap)
1086        pca, clustering = self.PCA_traj(n_top_clones, nclus)
1087
1088        n_cl = clustering.n_clusters
1089
1090        #Getting the top n_top_clones clonotypes at each time point
1091        top_clones = self.top_clones_set(n_top_clones)
1092        #Building a trajectory dataframe
1093        traj_frame = self.build_traj_frame(top_clones)
1094
1095        #Converting it in a numpy matrix
1096        traj_matrix = traj_frame.drop(['Clone cumul freq'], axis=1).to_numpy()
1097
1098        # Normalize each trajectory by its maximum
1099        norm_traj_matrix = traj_matrix/np.max(traj_matrix, axis=1)[:, np.newaxis]
1100
1101        fig, axs = plt.subplots(2, n_cl, figsize=(5*n_cl, 12))
1102        for cl in range(n_cl):
1103            trajs = norm_traj_matrix[clustering.labels_ == cl]
1104            axs[0][cl].set_xlabel('Time', fontsize = 15)
1105            axs[0][cl].set_ylabel('Normalized frequency', fontsize = 15)
1106            axs[1][cl].set_xlabel('Time', fontsize = 15)
1107            axs[1][cl].set_ylabel('Normalized frequency', fontsize = 15)
1108            for traj in trajs:
1109                axs[0][cl].plot(self.times, traj, alpha=0.2, color=cmap(cl/n_cl))
1110            axs[1][cl].set_ylim(0,1)
1111            axs[1][cl].errorbar(self.times, np.mean(trajs, axis=0), 
1112                                yerr=np.std(trajs, axis=0), lw=3, color=cmap(cl/n_cl))
1113            #axs[1][cl].fill_between(times, np.quantile(trajs, 0.75, axis=0), np.quantile(trajs, 0.25, axis=0), color=colors[cl])
1114               
1115        plt.tight_layout()
1116        return fig, axs
1117
1118#===============================Data-Pre-Processing===================================
1119
1120class Data_Process():
1121
1122    """
1123    A class used to represent longitudinal RepSeq data and pre-analysis of the longitudinal data associated with
1124    one individual.
1125    ...
1126    Attributes
1127    ----------
1128    path : str
1129        the name of the path to get access to the data files to use for our analysis
1130    filename1 : str
1131        the name of the file of the RepSeq sample which can be the first replicate when deciphering the experimental noise 
1132        or the first time point RepSeq sample when analysing responding clones to a stimulus between two time points.
1133    filename2 : str
1134        the name of the file of the RepSeq sample which can be the second replicate when deciphering the experimental noise 
1135        or the second time point RepSeq sample when analysing responding clones to a stimulus between two time points.
1136    colnames1 : str
1137        list of columns names of data-set - first sample
1138    colnames2 : str
1139        list of columns names of data-set - second sample 
1140    Methods
1141    -------
1142    import_data() : 
1143        to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the 
1144        union of both samples.
1145    
1146    """
1147
1148    def __init__(self, path, filename1, filename2, colnames1,  colnames2):
1149
1150        self.path = path
1151        self.filename1 = filename1
1152        self.filename2 = filename2
1153        self.colnames1 = colnames1
1154        self.colnames2 = colnames2
1155    
1156
1157    def import_data(self):
1158        """
1159        to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the union of both samples.
1160        
1161        Parameters
1162        ----------
1163        NONE
1164        Returns
1165        -------
1166        number_clones
1167            numpy array, number of clones in the data frame which is the union of the two RepSeq used as entries of the function
1168        df
1169            pandas data-frame which is the data-frame containing the informations labeled in colnames vector string
1170            for both RepSeq samples taken as input.
1171        """
1172
1173        mincount = 0
1174        maxcount = np.inf
1175        
1176        headerline=0 #line number of headerline
1177        newnames=['Clone_fraction','Clone_count','ntCDR3','AACDR3']   
1178
1179        if self.filename1[-2:] == 'gz':
1180            F1Frame_chunk=pd.read_csv(self.path + self.filename1, delimiter='\t',usecols=self.colnames1,header=headerline, compression = 'gzip')[self.colnames1]
1181        else:
1182            F1Frame_chunk=pd.read_csv(self.path + self.filename1, delimiter='\t',usecols=self.colnames1,header=headerline)[self.colnames1]
1183
1184        if self.filename2[-2:] == 'gz':
1185            F2Frame_chunk=pd.read_csv(self.path + self.filename2, delimiter='\t',usecols=self.colnames2,header=headerline, compression = 'gzip')[self.colnames2]
1186
1187        else:
1188            F2Frame_chunk=pd.read_csv(self.path + self.filename2, delimiter='\t',usecols=self.colnames2,header=headerline)[self.colnames2]
1189
1190        F1Frame_chunk.columns=newnames
1191        F2Frame_chunk.columns=newnames
1192        suffixes=('_1','_2')
1193        mergedFrame=pd.merge(F1Frame_chunk,F2Frame_chunk,on=newnames[2],suffixes=suffixes,how='outer')
1194        for nameit in [0,1]:
1195            for labelit in suffixes:
1196                mergedFrame.loc[:,newnames[nameit]+labelit].fillna(int(0),inplace=True)
1197                if nameit==1:
1198                    mergedFrame.loc[:,newnames[nameit]+labelit].astype(int)
1199        def dummy(x):
1200            val=x[0]
1201            if pd.isnull(val):
1202                val=x[1]    
1203            return val
1204        mergedFrame.loc[:,newnames[3]+suffixes[0]]=mergedFrame.loc[:,[newnames[3]+suffixes[0],newnames[3]+suffixes[1]]].apply(dummy,axis=1) #assigns AA sequence to clones, creates duplicates
1205        mergedFrame.drop(newnames[3]+suffixes[1], 1,inplace=True) #removes duplicates
1206        mergedFrame.rename(columns = {newnames[3]+suffixes[0]:newnames[3]}, inplace = True)
1207        mergedFrame=mergedFrame[[newname+suffix for newname in newnames[:2] for suffix in suffixes]+[newnames[2],newnames[3]]]
1208        filterout=((mergedFrame.Clone_count_1<mincount) & (mergedFrame.Clone_count_2==0)) | ((mergedFrame.Clone_count_2<mincount) & (mergedFrame.Clone_count_1==0)) #has effect only if mincount>0
1209        number_clones=len(mergedFrame)
1210        return number_clones,mergedFrame.loc[((mergedFrame.Clone_count_1<=maxcount) & (mergedFrame.Clone_count_2<=maxcount)) & ~filterout]
1211
1212        
1213            
1214
1215#===============================Noise-Model===================================================================
1216### Noise Model
1217
1218class Noise_Model():
1219
1220    """
1221    A class used to build an object associated to methods in order to learn the experimental noise from same day 
1222    biological RepSeq samples.
1223    ...
1224    Methods
1225    -------
1226    get_sparserep(df) :
1227        get sparse representation of the abundances / frequencies of the TCR clones present in both RepSeq samples of interest.
1228        this changes the data input to fasten the algorithm
1229    learn_null_model(df, noise_model, init_paras,  output_dir = None, filename = None, display_loss_function = False) :
1230        function to optimize the likelihood associated to the experimental noise model and get the associated parameters.
1231    diversity_estimate(df, paras, noise_model) :
1232        function to get the estimation of diversity from the noise model information.
1233    """
1234
1235
1236    def get_sparserep(self, df): 
1237        """
1238        Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation.
1239        unicountvals_1(2) are the unique values of n1(2).
1240        sparse_rep_counts gives the counts of unique pairs.
1241        ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair.
1242        len(indn1)=len(indn2)=len(sparse_rep_counts)
1243        Parameters
1244        ----------
1245        df : pandas data frame
1246            data-frame which is the output of the method .import_data() for one Data_Process instance.
1247            these data-frame should give the list of TCR clones present in two replicates RepSeq samples
1248            associated to their clone frequencies and clone abundances in the first and second replicate.
1249        Returns
1250        -------
1251        indn1
1252            numpy array list of indexes of all values of unicountvals_1
1253        indn2
1254            numpy array list of indexes of all values of unicountvals_2
1255        sparse_rep_counts
1256            numpy array, # of clones having the read counts pair {(n1,n2)} 
1257        unicountvals_1
1258            numpy array list of unique counts values present in the first sample in df[clone_count_1]
1259        unicountvals_2
1260            numpy array list of unique counts values present in the second sample in df[clone_count_2]
1261        Nreads1
1262            float, total number of counts/reads in the first sample referred in df by "_1"
1263        Nreads2
1264            float, total number of counts/reads in the second sample referred in df by "_2"
1265        """
1266        
1267        counts = df.loc[:,['Clone_count_1', 'Clone_count_2']]
1268        counts['paircount'] = 1  # gives a weight of 1 to each observed clone
1269
1270        clone_counts = counts.groupby(['Clone_count_1', 'Clone_count_2']).sum()
1271        sparse_rep_counts = np.asarray(clone_counts.values.flatten(), dtype=int)
1272        clonecountpair_vals = clone_counts.index.values
1273        indn1 = np.asarray([clonecountpair_vals[it][0] for it in range(len(sparse_rep_counts))], dtype=int)
1274        indn2 = np.asarray([clonecountpair_vals[it][1] for it in range(len(sparse_rep_counts))], dtype=int)
1275        NreadsI = np.sum(counts['Clone_count_1'])
1276        NreadsII = np.sum(counts['Clone_count_2'])
1277
1278        unicountvals_1, indn1 = np.unique(indn1, return_inverse=True)
1279        unicountvals_2, indn2 = np.unique(indn2, return_inverse=True)
1280
1281        return indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII
1282
1283
1284
1285    def _NegBinPar(self,m,v,mvec): 
1286        '''
1287        Same as NegBinParMtr, but for m and v being scalars.
1288        Assumes m>0.
1289        Output is (len(mvec),) array
1290        '''
1291        mmax=mvec[-1]
1292        p = 1-m/v
1293        r = m*m/v/p
1294        NBvec=np.arange(mmax+1,dtype=float)   
1295        NBvec[1:]=np.log((NBvec[1:]+r-1)/NBvec[1:]*p) #vectorization won't help unfortuneately here since log needs to be over array
1296        NBvec[0]=r*math.log(m/v)
1297        NBvec=np.exp(np.cumsum(NBvec)[mvec]) #save a bit here
1298        return NBvec
1299
1300    def _NegBinParMtr(self,m,v,nvec): #speed up only insofar as the log and exp are called once on array instead of multiple times on rows
1301        ''' 
1302        computes NegBin probabilities over the ordered (but possibly discontiguous) vector (nvec) 
1303        for mean/variance combinations given by the mean (m) and variance (v) vectors. 
1304        Note that m<v for negative binomial.
1305        Output is (len(m),len(nvec)) array
1306        '''
1307        nmax=nvec[-1]
1308        p = 1-m/v
1309        r = m*m/v/p
1310        NBvec=np.arange(nmax+1,dtype=float)
1311        NBvec=np.log((NBvec+r[:,np.newaxis]-1)*(p[:,np.newaxis]/NBvec))
1312        NBvec[:,0]=r*np.log(m/v) #handle NBvec[0]=0, treated specially when m[0]=0, see below
1313        NBvec=np.exp(np.cumsum(NBvec,axis=1)) #save a bit here
1314        if m[0]==0:
1315            NBvec[0,:]=0.
1316            NBvec[0,0]=1.
1317        NBvec=NBvec[:,nvec]
1318        return NBvec
1319
1320    def _PoisPar(self, Mvec,unicountvals):
1321        #assert Mvec[0]==0, "first element needs to be zero"
1322        nmax=unicountvals[-1]
1323        nlen=len(unicountvals)
1324        mlen=len(Mvec)
1325        Nvec=unicountvals
1326        logNvec=-np.insert(np.cumsum(np.log(np.arange(1,nmax+1))),0,0.)[unicountvals] #avoid n=0 nans  
1327        Nmtr=np.exp(Nvec[np.newaxis,:]*np.log(Mvec)[:,np.newaxis]+logNvec[np.newaxis,:]-Mvec[:,np.newaxis]) # np.log(Mvec) throws warning: since log(0)=-inf
1328        if Mvec[0]==0:
1329            Nmtr[0,:]=np.zeros((nlen,)) #when m=0, n=0, and so get rid of nans from log(0)
1330            Nmtr[0,0]=1. #handled belowacq_model_type
1331        if unicountvals[0]==0: #if n=0 included get rid of nans from log(0)
1332            Nmtr[:,0]=np.exp(-Mvec)
1333        return Nmtr
1334
1335    def _get_rhof(self,alpha_rho, nfbins,fmin,freq_dtype):
1336        '''
1337        generates power law (power is alpha_rho) clone frequency distribution over 
1338        freq_nbins discrete logarithmically spaced frequences between fmin and 1 of dtype freq_dtype
1339        Outputs log probabilities obtained at log frequencies'''
1340        fmax=1e0
1341        logfvec=np.linspace(np.log10(fmin),np.log10(fmax), nfbins)
1342        logfvec=np.array(np.log(np.power(10,logfvec)) ,dtype=freq_dtype).flatten()  
1343        logrhovec=logfvec*alpha_rho
1344        integ=np.exp(logrhovec+logfvec,dtype=freq_dtype)
1345        normconst=np.log(np.dot(np.diff(logfvec)/2.,integ[1:]+integ[:-1]))
1346        logrhovec-=normconst 
1347        return logrhovec,logfvec, normconst
1348
1349
1350    def _get_logPn_f(self,unicounts,Nreads,logfvec, noise_model, paras):
1351
1352        """
1353        tools to compute the likelihood of the noise model. It is not useful for the user.
1354        """
1355
1356        # Choice of the model:
1357        
1358        if noise_model<1:
1359
1360            m_total=float(np.power(10, paras[3])) 
1361            r_c=Nreads/m_total
1362        if noise_model<2:
1363
1364            beta_mv= paras[1]
1365            alpha_mv=paras[2]
1366            
1367        if noise_model<1: #for models that include cell counts
1368            #compute parametrized range (mean-sigma,mean+5*sigma) of m values (number of cells) conditioned on n values (reads) appearing in the data only 
1369            nsigma=5.
1370            nmin=300.
1371            #for each n, get actual range of m to compute around n-dependent mean m
1372            m_low =np.zeros((len(unicounts),),dtype=int)
1373            m_high=np.zeros((len(unicounts),),dtype=int)
1374            for nit,n in enumerate(unicounts):
1375                mean_m=n/r_c
1376                dev=nsigma*np.sqrt(mean_m)
1377                m_low[nit] =int(mean_m-  dev) if (mean_m>dev**2) else 0                         
1378                m_high[nit]=int(mean_m+5*dev) if (      n>nmin) else int(10*nmin/r_c)
1379            m_cellmax=np.max(m_high)
1380            #across n, collect all in-range m
1381            mvec_bool=np.zeros((m_cellmax+1,),dtype=bool) #cheap bool
1382            nvec=range(len(unicounts))
1383            for nit in nvec:
1384                mvec_bool[m_low[nit]:m_high[nit]+1]=True  #mask vector
1385            mvec=np.arange(m_cellmax+1)[mvec_bool]                
1386            #transform to in-range index
1387            for nit in nvec:
1388                m_low[nit]=np.where(m_low[nit]==mvec)[0][0]
1389                m_high[nit]=np.where(m_high[nit]==mvec)[0][0]
1390
1391        Pn_f=np.zeros((len(logfvec),len(unicounts)))
1392        if noise_model==0:
1393
1394            mean_m=m_total*np.exp(logfvec)
1395            var_m=mean_m+beta_mv*np.power(mean_m,alpha_mv)
1396            Poisvec = self._PoisPar(mvec*r_c,unicounts)
1397            for f_it in range(len(logfvec)):
1398                NBvec=self._NegBinPar(mean_m[f_it],var_m[f_it],mvec)
1399                for n_it,n in enumerate(unicounts):
1400                    Pn_f[f_it,n_it]=np.dot(NBvec[m_low[n_it]:m_high[n_it]+1],Poisvec[m_low[n_it]:m_high[n_it]+1,n_it]) 
1401        
1402        elif noise_model==1:
1403
1404            mean_n=Nreads*np.exp(logfvec)
1405            var_n=mean_n+beta_mv*np.power(mean_n,alpha_mv)
1406            Pn_f = self._NegBinParMtr(mean_n,var_n,unicounts)
1407        elif noise_model==2:
1408
1409            mean_n=Nreads*np.exp(logfvec)
1410            Pn_f= self._PoisPar(mean_n,unicounts)
1411        else:
1412            print('acq_model is 0,1, or 2 only')
1413
1414        return np.log(Pn_f)
1415
1416    #-----------------------------Null-Model-optimization--------------------------
1417        
1418    def _get_Pn1n2(self, paras, sparse_rep, noise_model):
1419
1420        """
1421        Tool to compute likelihood of the noise model. It is not useful for the user.
1422        """
1423
1424        indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2,NreadsI,NreadsII = sparse_rep
1425            
1426        nfbins = 1200
1427        freq_dtype = float
1428
1429        # Parameters
1430
1431        alpha = paras[0]
1432        fmin = np.power(10,paras[-1])
1433
1434        # 
1435        logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype)
1436
1437        # 
1438
1439        logfvec_tmp=deepcopy(logfvec)
1440
1441        logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI,logfvec_tmp, noise_model, paras)
1442        logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII,logfvec_tmp, noise_model, paras)
1443
1444        # for the trapezoid integral methods
1445
1446        dlogfby2=np.diff(logfvec)/2
1447
1448        # Compute P(0,0) for the normalization constraint
1449        integ = np.exp(logrhofvec + logPn2_f[:, 0] + logPn1_f[:, 0] + logfvec)
1450        Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1])
1451
1452        #print("computing P(n1,n2)")
1453        Pn1n2 = np.zeros(len(sparse_rep_counts))  # 1D representation
1454        for it, (ind1, ind2) in enumerate(zip(indn1, indn2)):
1455            integ = np.exp(logPn1_f[:, ind1] + logrhofvec + logPn2_f[:, ind2] + logfvec)
1456            Pn1n2[it] = np.dot(dlogfby2, integ[1:] + integ[:-1])
1457        Pn1n2 /= 1. - Pn0n0  # renormalize
1458        return -np.dot(sparse_rep_counts, np.where(Pn1n2 > 0, np.log(Pn1n2), 0)) / float(np.sum(sparse_rep_counts))
1459
1460    
1461
1462
1463    def _callback(self, paras, nparas, sparse_rep, noise_model):
1464        '''prints iteration info. called by scipy.minimize. Not useful for the user.'''
1465
1466        global curr_iter
1467        #curr_iter = 0
1468        global Loss_function 
1469        print(''.join(['{0:d} ']+['{'+str(it)+':3.6f} ' for it in range(1,len(paras)+1)]).format(*([curr_iter]+list(paras))))
1470        #print ('{' + str(len(paras)+1) + ':3.6f}'.format( [self.get_Pn1n2(paras, sparse_rep, acq_model_type)]))
1471        Loss_function = self._get_Pn1n2(paras, sparse_rep, noise_model)
1472        print(Loss_function)
1473        curr_iter += 1
1474        
1475
1476
1477    # Constraints for the Null-Model, no filtered 
1478    def _nullmodel_constr_fn(self, paras, sparse_rep, noise_model, constr_type):
1479            
1480        '''
1481        returns either or both of the two level-set functions: log<f>-log(1/N), with N=Nclones/(1-P(0,0)) and log(Z_f), with Z_f=N<f>_{n+n'=0} + sum_i^Nclones <f>_{f|n,n'}
1482        not useful for the user
1483        '''
1484
1485        # Choice of the model: 
1486
1487        indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII = sparse_rep
1488
1489        #Variables that would be chosen in the future by the user 
1490        nfbins = 1200
1491        freq_dtype = float
1492
1493        alpha = paras[0]  # power law exponent
1494        fmin = np.power(10, paras[-1]) # true minimal frequency 
1495
1496        logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype)
1497        dlogfby2 = np.diff(logfvec) / 2.  # 1/2 comes from trapezoid integration below
1498
1499        integ = np.exp(logrhofvec + 2 * logfvec)
1500        avgf_ps = np.dot(dlogfby2, integ[:-1] + integ[1:])
1501
1502        logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI, logfvec, noise_model, paras)
1503        logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII, logfvec, noise_model, paras)
1504
1505        integ = np.exp(logPn1_f[:, 0] + logPn2_f[:, 0] + logrhofvec + logfvec)
1506        Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1])
1507        logPnng0 = np.log(1 - Pn0n0)
1508        avgf_null_pair = np.exp(logPnng0 - np.log(np.sum(sparse_rep_counts)))
1509
1510        C1 = np.log(avgf_ps) - np.log(avgf_null_pair)
1511
1512        integ = np.exp(logPn1_f[:, 0] + logPn2_f[:, 0] + logrhofvec + 2 * logfvec)
1513        log_avgf_n0n0 = np.log(np.dot(dlogfby2, integ[1:] + integ[:-1]))
1514
1515        integ = np.exp(logPn1_f[:, indn1] + logPn2_f[:, indn2] + logrhofvec[:, np.newaxis] + logfvec[:, np.newaxis])
1516        log_Pn1n2 = np.log(np.sum(dlogfby2[:, np.newaxis] * (integ[1:, :] + integ[:-1, :]), axis=0))
1517        integ = np.exp(np.log(integ) + logfvec[:, np.newaxis])
1518        tmp = deepcopy(log_Pn1n2)
1519        tmp[tmp == -np.Inf] = np.Inf  # since subtracted in next line
1520        avgf_n1n2 = np.exp(np.log(np.sum(dlogfby2[:, np.newaxis] * (integ[1:, :] + integ[:-1, :]), axis=0)) - tmp)
1521        log_sumavgf = np.log(np.dot(sparse_rep_counts, avgf_n1n2))
1522
1523        logNclones = np.log(np.sum(sparse_rep_counts)) - logPnng0
1524        Z = np.exp(logNclones + np.log(Pn0n0) + log_avgf_n0n0) + np.exp(log_sumavgf)
1525
1526        C2 = np.log(Z)
1527
1528        
1529        # print('C1:'+str(C1)+' C2:'+str(C2))
1530        if constr_type == 0:
1531            return C1
1532        elif constr_type == 1:
1533            return C2
1534        else:
1535            return C1, C2
1536
1537
1538        
1539    # Null-Model optimization learning 
1540
1541    def learn_null_model(self, df, noise_model, init_paras,  output_dir = None, filename = None, display_loss_function = False):  # constraint type 1 gives only low error modes, see paper for details.
1542        """
1543        Parameters
1544        ----------
1545        df : pandas data frame
1546            data-frame which is the output of the method .import_data() for one Data_Process instance.
1547            these data-frame should give the list of TCR clones present in two replicates RepSeq samples
1548            associated to their clone frequencies and clone abundances in the first and second replicate.
1549        noise_model: numpy array
1550            choice of noise model 
1551        init_paras: numpy array
1552            initial vector of parameters to start the optimization of the model from data (df)
1553        output_dir : str
1554            default value is None, it is the output directory name i which we want to save the values of the parameters
1555        display_loss_function : bool
1556            boolean variable to chose if we want to print the loss function during the experimental noise learning, default value is 
1557            None.
1558        
1559        Returns
1560        -------
1561        outstruct
1562            numpy array parameters of the noise model
1563        constr_value
1564            float, value of the constraint 
1565    
1566        """
1567            
1568        # Data introduction
1569        sparse_rep = self.get_sparserep(df)
1570        constr_type = 1
1571
1572        # Choice of the model:
1573        # Parameters initialization depending on the model 
1574        if noise_model < 1:
1575            parameter_labels = ['alph_rho', 'beta', 'alpha', 'm_total', 'fmin']
1576        elif noise_model == 1:
1577            parameter_labels = ['alph_rho', 'beta', 'alpha', 'fmin']
1578        else:
1579            parameter_labels = ['alph_rho', 'fmin']
1580
1581        assert len(parameter_labels) == len(init_paras), "number of model and initial paras differ!"
1582
1583        condict = {'type': 'eq', 'fun': self._nullmodel_constr_fn, 'args': (sparse_rep, noise_model, constr_type)}
1584
1585
1586        partialobjfunc = partial(self._get_Pn1n2, sparse_rep=sparse_rep, noise_model=noise_model)
1587        nullfunctol = 1e-6
1588        nullmaxiter = 200
1589        header = ['Iter'] + parameter_labels
1590        print(''.join(['{' + str(it) + ':9s} ' for it in range(len(init_paras) + 1)]).format(*header))
1591            
1592        global curr_iter
1593        curr_iter = 1
1594        callbackp = partial(self._callback, nparas=len(init_paras), sparse_rep = sparse_rep, noise_model= noise_model)
1595        outstruct = minimize(partialobjfunc, init_paras, method='SLSQP', callback=callbackp, constraints=condict,
1596                        options={'ftol': nullfunctol, 'disp': True, 'maxiter': nullmaxiter})
1597            
1598        constr_value = self._nullmodel_constr_fn(outstruct.x, sparse_rep, noise_model, constr_type)
1599
1600        if noise_model < 1:
1601            parameter_labels = ['alph_rho', 'beta', 'alpha', 'm_total', 'fmin']
1602            d = {'label' : parameter_labels, 'value': outstruct.x}
1603            df = pd.DataFrame(data = d)
1604        elif noise_model == 1:
1605            parameter_labels = ['alph_rho', 'beta', 'alpha', 'fmin']
1606            d = {'label' : parameter_labels, 'value': outstruct.x}
1607            df = pd.DataFrame(data = d)
1608        else:
1609            parameter_labels = ['alph_rho', 'fmin']
1610            d = {'label' : parameter_labels, 'value': outstruct.x}
1611            df = pd.DataFrame(data = d)
1612
1613
1614        if (output_dir == None) & (filename == None):
1615            df.to_csv('nullpara' + str(noise_model)+ '.txt', sep = '\t')
1616
1617        elif (output_dir != None) & (filename == None):
1618            df.to_csv(output_dir + '/nullpara' + str(noise_model)+ '.txt', sep = '\t')
1619
1620        else :
1621            df.to_csv(output_dir + '/' + filename + '.txt', sep = '\t')
1622
1623        return outstruct, constr_value
1624
1625    def diversity_estimate(self, df, paras, noise_model):
1626
1627        """
1628        Estimate diversity of the individual repertoire from the experimental noise learning step. 
1629        Parameters
1630        ----------
1631        df : data-frame 
1632            The data-frame which has been used to learn the noise model
1633        paras : numpy array
1634            vector containing the noise parameters
1635        noise_model : int
1636            choice of noise model 
1637        Returns
1638        -------
1639        diversity_estimate
1640            float, diversity estimate from the noise model inference.
1641    
1642        """
1643
1644        sparse_rep = self.get_sparserep(df)
1645
1646        indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2, NreadsI, NreadsII = sparse_rep
1647            
1648        nfbins = 1200
1649        freq_dtype = float
1650
1651        # Parameters
1652
1653        alpha = paras[0]
1654        fmin = np.power(10,paras[-1])
1655
1656        # 
1657        logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype)
1658
1659        # 
1660
1661        logfvec_tmp=deepcopy(logfvec)
1662
1663        logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI,logfvec_tmp, noise_model, paras)
1664        logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII,logfvec_tmp, noise_model, paras)
1665
1666        # for the trapezoid integral methods
1667
1668        dlogfby2=np.diff(logfvec)/2
1669
1670        # Compute P(0,0) for the normalization constraint
1671        integ = np.exp(logrhofvec + logPn2_f[:, 0] + logPn1_f[:, 0] + logfvec)
1672        Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1])
1673
1674        #print(np.sum(sparse_rep_counts))
1675        N_obs = np.sum(sparse_rep_counts)
1676
1677        return int(N_obs/(1-Pn0n0))
1678
1679
1680#============================================Differential expression =============================================================
1681
1682class Expansion_Model():
1683    
1684    """
1685    A class used to build an object associated to methods in order to select significant expanding or 
1686    contracting clones from RepSeq samples taken at two different time points.
1687    ...
1688    Methods
1689    -------
1690    get_sparserep(df) :
1691        get sparse representation of the abundances / frequencies of the TCR clones present in RepSeq samples of both time points.
1692        This changes the data input to fasten the algorithm
1693    expansion_table(outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold):
1694        generate the table of clones that have been significantly detected to be responsive to an acute stimuli.
1695    """
1696
1697
1698    def get_sparserep(self, df): 
1699        """
1700        Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation.
1701        unicountvals_1(2) are the unique values of n1(2).
1702        sparse_rep_counts gives the counts of unique pairs.
1703        ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair.
1704        len(indn1)=len(indn2)=len(sparse_rep_counts)
1705        Parameters
1706        ----------
1707        df : pandas data frame
1708            data-frame which is the output of the method .import_data() for one Data_Process instance.
1709            these data-frame should give the list of TCR clones present in two RepSeq samples, talen at two 
1710            different time points, associated to their clone frequencies and clone abundances in the first and second replicate?
1711        Returns
1712        -------
1713        indn1
1714            numpy array list of indexes of all values of unicountvals_1
1715        indn2
1716            numpy array list of indexes of all values of unicountvals_2
1717        sparse_rep_counts
1718            numpy array, # of clones having the read counts pair {(n1,n2)} 
1719        unicountvals_1
1720            numpy array list of unique counts values present in the first sample in df[clone_count_1]
1721        unicountvals_2
1722            numpy array list of unique counts values present in the second sample in df[clone_count_2]
1723        Nreads1
1724            float, total number of counts/reads in the first sample referred in df by "_1" for first time point
1725        Nreads2
1726            float, total number of counts/reads in the second sample referred in df by "_2" for second time point
1727        """
1728        
1729        counts = df.loc[:,['Clone_count_1', 'Clone_count_2']]
1730        counts['paircount'] = 1  # gives a weight of 1 to each observed clone
1731
1732        clone_counts = counts.groupby(['Clone_count_1', 'Clone_count_2']).sum()
1733        sparse_rep_counts = np.asarray(clone_counts.values.flatten(), dtype=int)
1734        clonecountpair_vals = clone_counts.index.values
1735        indn1 = np.asarray([clonecountpair_vals[it][0] for it in range(len(sparse_rep_counts))], dtype=int)
1736        indn2 = np.asarray([clonecountpair_vals[it][1] for it in range(len(sparse_rep_counts))], dtype=int)
1737        NreadsI = np.sum(counts['Clone_count_1'])
1738        NreadsII = np.sum(counts['Clone_count_2'])
1739
1740        unicountvals_1, indn1 = np.unique(indn1, return_inverse=True)
1741        unicountvals_2, indn2 = np.unique(indn2, return_inverse=True)
1742
1743        return indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII
1744
1745    
1746
1747    def _NegBinPar(self,m,v,mvec): 
1748        '''
1749        Same as NegBinParMtr, but for m and v being scalars.
1750        Assumes m>0.
1751        Output is (len(mvec),) array
1752        '''
1753        mmax=mvec[-1]
1754        p = 1-m/v
1755        r = m*m/v/p
1756        NBvec=np.arange(mmax+1,dtype=float)   
1757        NBvec[1:]=np.log((NBvec[1:]+r-1)/NBvec[1:]*p) #vectorization won't help unfortuneately here since log needs to be over array
1758        NBvec[0]=r*math.log(m/v)
1759        NBvec=np.exp(np.cumsum(NBvec)[mvec]) #save a bit here
1760        return NBvec
1761
1762
1763    def _NegBinParMtr(self,m,v,nvec): #speed up only insofar as the log and exp are called once on array instead of multiple times on rows
1764        ''' 
1765        computes NegBin probabilities over the ordered (but possibly discontiguous) vector (nvec) 
1766        for mean/variance combinations given by the mean (m) and variance (v) vectors. 
1767        Note that m<v for negative binomial.
1768        Output is (len(m),len(nvec)) array
1769        '''
1770        nmax=nvec[-1]
1771        p = 1-m/v
1772        r = m*m/v/p
1773        NBvec=np.arange(nmax+1,dtype=float)
1774        NBvec=np.log((NBvec+r[:,np.newaxis]-1)*(p[:,np.newaxis]/NBvec))
1775        NBvec[:,0]=r*np.log(m/v) #handle NBvec[0]=0, treated specially when m[0]=0, see below
1776        NBvec=np.exp(np.cumsum(NBvec,axis=1)) #save a bit here
1777        if m[0]==0:
1778            NBvec[0,:]=0.
1779            NBvec[0,0]=1.
1780        NBvec=NBvec[:,nvec]
1781        return NBvec
1782
1783    def _PoisPar(self, Mvec,unicountvals):
1784        #assert Mvec[0]==0, "first element needs to be zero"
1785        nmax=unicountvals[-1]
1786        nlen=len(unicountvals)
1787        mlen=len(Mvec)
1788        Nvec=unicountvals
1789        logNvec=-np.insert(np.cumsum(np.log(np.arange(1,nmax+1))),0,0.)[unicountvals] #avoid n=0 nans  
1790        Nmtr=np.exp(Nvec[np.newaxis,:]*np.log(Mvec)[:,np.newaxis]+logNvec[np.newaxis,:]-Mvec[:,np.newaxis]) # np.log(Mvec) throws warning: since log(0)=-inf
1791        if Mvec[0]==0:
1792            Nmtr[0,:]=np.zeros((nlen,)) #when m=0, n=0, and so get rid of nans from log(0)
1793            Nmtr[0,0]=1. #handled belowacq_model_type
1794        if unicountvals[0]==0: #if n=0 included get rid of nans from log(0)
1795            Nmtr[:,0]=np.exp(-Mvec)
1796        return Nmtr
1797
1798    def _get_rhof(self,alpha_rho, nfbins,fmin,freq_dtype):
1799        '''
1800        generates power law (power is alpha_rho) clone frequency distribution over 
1801        freq_nbins discrete logarithmically spaced frequences between fmin and 1 of dtype freq_dtype
1802        Outputs log probabilities obtained at log frequencies'''
1803        fmax=1e0
1804        logfvec=np.linspace(np.log10(fmin),np.log10(fmax), nfbins)
1805        logfvec=np.array(np.log(np.power(10,logfvec)) ,dtype=freq_dtype).flatten()  
1806        logrhovec=logfvec*alpha_rho
1807        integ=np.exp(logrhovec+logfvec,dtype=freq_dtype)
1808        normconst=np.log(np.dot(np.diff(logfvec)/2.,integ[1:]+integ[:-1]))
1809        logrhovec-=normconst 
1810        return logrhovec,logfvec
1811
1812    
1813    def _get_logPn_f(self,unicounts,Nreads,logfvec, noise_model, paras):
1814
1815        """
1816        tools to compute the likelihood of the noise model. It is not useful for the user.
1817        """
1818        
1819        # Choice of the model:
1820        
1821        if noise_model<1:
1822
1823            m_total=float(np.power(10, paras[3])) 
1824            r_c=Nreads/m_total
1825        if noise_model<2:
1826
1827            beta_mv= paras[1]
1828            alpha_mv=paras[2]
1829            
1830        if noise_model<1: #for models that include cell counts
1831            #compute parametrized range (mean-sigma,mean+5*sigma) of m values (number of cells) conditioned on n values (reads) appearing in the data only 
1832            nsigma=5.
1833            nmin=300.
1834            #for each n, get actual range of m to compute around n-dependent mean m
1835            m_low =np.zeros((len(unicounts),),dtype=int)
1836            m_high=np.zeros((len(unicounts),),dtype=int)
1837            for nit,n in enumerate(unicounts):
1838                mean_m=n/r_c
1839                dev=nsigma*np.sqrt(mean_m)
1840                m_low[nit] =int(mean_m-  dev) if (mean_m>dev**2) else 0                         
1841                m_high[nit]=int(mean_m+5*dev) if (      n>nmin) else int(10*nmin/r_c)
1842            m_cellmax=np.max(m_high)
1843            #across n, collect all in-range m
1844            mvec_bool=np.zeros((m_cellmax+1,),dtype=bool) #cheap bool
1845            nvec=range(len(unicounts))
1846            for nit in nvec:
1847                mvec_bool[m_low[nit]:m_high[nit]+1]=True  #mask vector
1848            mvec=np.arange(m_cellmax+1)[mvec_bool]                
1849            #transform to in-range index
1850            for nit in nvec:
1851                m_low[nit]=np.where(m_low[nit]==mvec)[0][0]
1852                m_high[nit]=np.where(m_high[nit]==mvec)[0][0]
1853
1854        Pn_f=np.zeros((len(logfvec),len(unicounts)))
1855        if noise_model==0:
1856
1857            mean_m=m_total*np.exp(logfvec)
1858            var_m=mean_m+beta_mv*np.power(mean_m,alpha_mv)
1859            Poisvec = self._PoisPar(mvec*r_c,unicounts)
1860            for f_it in range(len(logfvec)):
1861                NBvec=self._NegBinPar(mean_m[f_it],var_m[f_it],mvec)
1862                for n_it,n in enumerate(unicounts):
1863                    Pn_f[f_it,n_it]=np.dot(NBvec[m_low[n_it]:m_high[n_it]+1],Poisvec[m_low[n_it]:m_high[n_it]+1,n_it]) 
1864        
1865        elif noise_model==1:
1866
1867            mean_n=Nreads*np.exp(logfvec)
1868            var_n=mean_n+beta_mv*np.power(mean_n,alpha_mv)
1869            Pn_f = self._NegBinParMtr(mean_n,var_n,unicounts)
1870        elif noise_model==2:
1871
1872            mean_n=Nreads*np.exp(logfvec)
1873            Pn_f= self._PoisPar(mean_n,unicounts)
1874        else:
1875            print('acq_model is 0,1,or 2 only')
1876
1877        return np.log(Pn_f)
1878
1879    def _get_Ps(self, alp,sbar,smax,stp):
1880        '''
1881        generates symmetric exponential distribution over log fold change
1882        with effect size sbar and nonresponding fraction 1-alp at s=0.
1883        computed over discrete range of s from -smax to smax in steps of size stp
1884        '''
1885        lamb=-stp/sbar
1886        smaxt=round(smax/stp)
1887        s_zeroind=int(smaxt)
1888        Z=2*(np.exp((smaxt+1)*lamb)-1)/(np.exp(lamb)-1)-1
1889        Ps=alp*np.exp(lamb*np.fabs(np.arange(-smaxt,smaxt+1)))/Z
1890        Ps[s_zeroind]+=(1-alp)
1891        return Ps
1892
1893    def _callbackFdiffexpr(self, Xi): #case dependent
1894        '''prints iteration info. called scipy.minimize'''
1895               
1896        print('{0: 3.6f}   {1: 3.6f}   '.format(Xi[0], Xi[1])+'\n')   
1897    
1898
1899    def _learning_dynamics_expansion_polished(self, df, paras_1, paras_2,  noise_model):
1900        """
1901        function to infer the expansion mode parameters - not usable by the user.
1902        """
1903
1904        indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2,NreadsI,NreadsII = self.get_sparserep(df)
1905
1906        alpha_rho = paras_1[0]
1907        fmin = np.power(10,paras_1[-1])
1908        freq_dtype = 'float64'
1909        nfbins = 1200 #Accuracy of the integration
1910
1911
1912        logrhofvec, logfvec = get_rhof(self, alpha_rho, nfbins, fmin, freq_dtype)
1913
1914        #Definition of svec
1915        smax = 25.0     #maximum absolute logfold change value
1916        s_step = 0.1
1917        s_0 = -1
1918        
1919        s_step_old= s_step
1920        logf_step= logfvec[1] - logfvec[0] #use natural log here since f2 increments in increments in exp().  
1921        f2s_step= int(round(s_step/logf_step)) #rounded number of f-steps in one s-step
1922        s_step= float(f2s_step)*logf_step
1923        smax= s_step*(smax/s_step_old)
1924        svec= s_step*np.arange(0,int(round(smax/s_step)+1))   
1925        svec= np.append(-svec[1:][::-1],svec)
1926
1927        smaxind=(len(svec)-1)/2
1928        f2s_step=int(round(s_step/logf_step)) #rounded number of f-steps in one s-step
1929        logfmin=logfvec[0 ]-f2s_step*smaxind*logf_step
1930        logfmax=logfvec[-1]+f2s_step*smaxind*logf_step
1931        
1932        logfvecwide = np.linspace(logfmin,logfmax,len(logfvec)+2*smaxind*f2s_step) #a wider domain for the second frequency f2=f1*exp(s)
1933            
1934        # Compute P(n1|f) and P(n2|f), each in an iteration of the following loop
1935
1936        for it in range(2):
1937            if it == 0:
1938                unicounts=unicountvals_1
1939                logfvec_tmp=deepcopy(logfvec)
1940                Nreads = NreadsI
1941                paras = paras_1
1942            else:
1943                unicounts=unicountvals_2
1944                logfvec_tmp=deepcopy(logfvecwide) #contains s-shift for sampled data method
1945                Nreads = NreadsII
1946                paras = paras_2
1947            if it == 0:
1948                logPn1_f = self._get_logPn_f( unicounts, Nreads, logfvec_tmp, noise_model, paras)
1949
1950            else:
1951                logPn2_f = self._get_logPn_f(unicounts, Nreads, logfvec_tmp, noise_model, paras)
1952
1953        #for the trapezoid method
1954        dlogfby2=np.diff(logfvec)/2 
1955
1956        # Computing P(n1,n2|f,s)
1957        Pn1n2_s=np.zeros((len(svec), len(unicountvals_1), len(unicountvals_2))) 
1958
1959        for s_it,s in enumerate(svec):
1960            for it,(n1_it, n2_it) in enumerate(zip(indn1,indn2)):
1961                integ = np.exp(logrhofvec+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),n2_it]+logPn1_f[:,n1_it]+ logfvec )
1962                Pn1n2_s[s_it, n1_it, n2_it] = np.dot(dlogfby2,integ[1:] + integ[:-1])
1963            
1964    
1965        Pn0n0_s = np.zeros(svec.shape)
1966        for s_it,s in enumerate(svec):    
1967            integ=np.exp(logPn1_f[:,0]+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),0]+logrhofvec+logfvec)
1968            Pn0n0_s[s_it]=np.dot(dlogfby2,integ[1:]+integ[:-1])
1969            
1970    
1971        N_obs = np.sum(sparse_rep_counts)
1972        print("N_obs: " + str(N_obs))
1973    
1974            
1975        def cost(PARAS):
1976
1977            alp = PARAS[0]
1978            sbar = PARAS[1]
1979
1980            Ps = _get_Ps(self,alp,sbar,smax,s_step)
1981            Pn0n0=np.dot(Pn0n0_s,Ps)
1982            Pn1n2_ps=np.sum(Pn1n2_s*Ps[:,np.newaxis,np.newaxis],0)
1983            Pn1n2_ps/=1-Pn0n0
1984            print(Pn0n0)
1985
1986       
1987
1988            Energy = - np.dot(sparse_rep_counts/float(N_obs),np.where(Pn1n2_ps[indn1,indn2]>0,np.log(Pn1n2_ps[indn1,indn2]),0)) 
1989                
1990            return Energy
1991
1992    #--------------------------Compute-the-grid-----------------------------------------
1993        
1994        print('Calculation Surface : \n')
1995        st = time.time()
1996
1997        npoints = 20 #to be chosen by the user 
1998        alpvec = np.logspace(-3,np.log10(0.99), npoints)
1999        sbarvec = np.linspace(0.01,5, npoints)
2000
2001        LSurface =np.zeros((len(sbarvec),len(alpvec)))
2002        for i in range(len(sbarvec)):
2003            for j in range(len(alpvec)):
2004                LSurface[i, j]=  - cost([alpvec[j], sbarvec[i]])
2005        
2006        alpmesh, sbarmesh = np.meshgrid(alpvec, sbarvec)
2007        a,b = np.where(LSurface == np.max(LSurface))
2008        print("--- %s seconds ---" % (time.time() - st))
2009    
2010    
2011    #------------------------------Optimization----------------------------------------------
2012        
2013        optA = alpmesh[a[0],b[0]]
2014        optB = sbarmesh[a[0],b[0]]
2015                  
2016        print('polish parameter estimate from '+ str(optA)+' '+str(optB))
2017        initparas=(optA,optB)  
2018    
2019
2020        outstruct = minimize(cost, initparas, method='SLSQP', callback=_callbackFdiffexpr, tol=1e-6,options={'ftol':1e-8 ,'disp': True,'maxiter':300})
2021
2022        return outstruct.x, Pn1n2_s, Pn0n0_s, svec
2023
2024    def _learning_dynamics_expansion(self, sparse_rep, paras_1, paras_2, noise_model, display_plot=False):
2025        """
2026        function to infer the expansion mode parameters - not usable by the user.
2027        """
2028
2029        indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2,NreadsI,NreadsII = sparse_rep
2030
2031        alpha_rho = paras_1[0]
2032        fmin = np.power(10,paras_1[-1])
2033        freq_dtype = 'float64'
2034        nfbins = 1200 #Accuracy of the integration
2035
2036
2037        logrhofvec, logfvec = self.get_rhof(alpha_rho, nfbins, fmin, freq_dtype)
2038
2039        #Definition of svec
2040        smax = 25.0     #maximum absolute logfold change value
2041        s_step = 0.1
2042        s_0 = -1
2043        
2044        s_step_old= s_step
2045        logf_step= logfvec[1] - logfvec[0] #use natural log here since f2 increments in increments in exp().  
2046        f2s_step= int(round(s_step/logf_step)) #rounded number of f-steps in one s-step
2047        s_step= float(f2s_step)*logf_step
2048        smax= s_step*(smax/s_step_old)
2049        svec= s_step*np.arange(0,int(round(smax/s_step)+1))   
2050        svec= np.append(-svec[1:][::-1],svec)
2051
2052        smaxind=(len(svec)-1)/2
2053        f2s_step=int(round(s_step/logf_step)) #rounded number of f-steps in one s-step
2054        logfmin=logfvec[0 ]-f2s_step*smaxind*logf_step
2055        logfmax=logfvec[-1]+f2s_step*smaxind*logf_step
2056        
2057        logfvecwide = np.linspace(logfmin,logfmax,int(len(logfvec)+2*smaxind*f2s_step)) #a wider domain for the second frequency f2=f1*exp(s)
2058            
2059        # Compute P(n1|f) and P(n2|f), each in an iteration of the following loop
2060
2061        for it in range(2):
2062            if it == 0:
2063                unicounts=unicountvals_1
2064                logfvec_tmp=deepcopy(logfvec)
2065                Nreads = NreadsI
2066                paras = paras_1
2067            else:
2068                unicounts=unicountvals_2
2069                logfvec_tmp=deepcopy(logfvecwide) #contains s-shift for sampled data method
2070                Nreads = NreadsII
2071                paras = paras_2
2072            if it == 0:
2073                logPn1_f = self._get_logPn_f(unicounts, Nreads, logfvec_tmp, noise_model, paras)
2074
2075            else:
2076                logPn2_f = self._get_logPn_f(unicounts, Nreads, logfvec_tmp, noise_model, paras)
2077
2078        #for the trapezoid method
2079        dlogfby2=np.diff(logfvec)/2 
2080
2081        # Computing P(n1,n2|f,s)
2082        Pn1n2_s=np.zeros((len(svec), len(unicountvals_1), len(unicountvals_2))) 
2083
2084        for s_it,s in enumerate(svec):
2085            for it,(n1_it, n2_it) in enumerate(zip(indn1,indn2)):
2086                integ = np.exp(logrhofvec+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),n2_it]+logPn1_f[:,n1_it]+ logfvec )
2087                Pn1n2_s[s_it, n1_it, n2_it] = np.dot(dlogfby2,integ[1:] + integ[:-1])
2088            
2089    
2090        Pn0n0_s = np.zeros(svec.shape)
2091        for s_it,s in enumerate(svec):    
2092            integ=np.exp(logPn1_f[:,0]+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),0]+logrhofvec+logfvec)
2093            Pn0n0_s[s_it]=np.dot(dlogfby2,integ[1:]+integ[:-1])
2094            
2095   
2096        N_obs = np.sum(sparse_rep_counts)
2097        print("N_obs: " + str(N_obs))
2098    
2099            
2100        def cost(PARAS):
2101
2102            alp = PARAS[0]
2103            sbar = PARAS[1]
2104
2105            Ps = self._get_Ps(alp,sbar,smax,s_step)
2106            Pn0n0=np.dot(Pn0n0_s,Ps)
2107            Pn1n2_ps=np.sum(Pn1n2_s*Ps[:,np.newaxis,np.newaxis],0)
2108            Pn1n2_ps/=1-Pn0n0
2109            #print(Pn0n0)
2110
2111       
2112
2113            Energy = - np.dot(sparse_rep_counts/float(N_obs),np.where(Pn1n2_ps[indn1,indn2]>0,np.log(Pn1n2_ps[indn1,indn2]),0)) 
2114                
2115            return Energy
2116
2117    #--------------------------Compute-the-grid-----------------------------------------
2118        
2119        print('Calculation Surface : \n')
2120        st = time.time()
2121
2122        npoints = 50 #to be chosen by the user 
2123        alpvec = np.logspace(-3,np.log10(0.99), npoints)
2124        sbarvec = np.linspace(0.01,5, npoints)
2125
2126        LSurface =np.zeros((len(sbarvec),len(alpvec)))
2127        for i in range(len(sbarvec)):
2128            for j in range(len(alpvec)):
2129                LSurface[i, j]=  - cost([alpvec[j], sbarvec[i]])
2130        
2131        alpmesh, sbarmesh = np.meshgrid(alpvec, sbarvec)
2132        a,b = np.where(LSurface == np.max(LSurface))
2133        print("--- %s seconds ---" % (time.time() - st))
2134    
2135    #---------------------------Plot-the-grid-------------------------------------------
2136        if display_plot:
2137
2138            fig, ax =plt.subplots(1, figsize=(10,8))
2139
2140         
2141            a,b = np.where(LSurface == np.max(LSurface))
2142
2143            ax.contour(alpmesh, sbarmesh, LSurface, linewidths=1, colors='k', linestyles = 'solid')
2144            plt.contourf(alpmesh, sbarmesh, LSurface, 20, cmap = 'viridis', alpha= 0.8)
2145
2146            xmax = alpmesh[a[0],b[0]]
2147            ymax = sbarmesh[a[0],b[0]]
2148            text= r"$ alpha={:.3f}, s={:.3f} $".format(xmax, ymax)
2149            bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72)
2150            arrowprops=dict(arrowstyle="->",connectionstyle="angle,angleA=0,angleB=80")
2151            kw = dict(xycoords='data',textcoords="axes fraction",
2152                arrowprops=arrowprops, bbox=bbox_props, ha="right", va="top")
2153            plt.annotate(text, xy=(xmax, ymax), xytext=(0.94,0.96), **kw)
2154            plt.xlabel(r'$ \alpha, \ size \ of \ the \ repertoire \ that \ answers \ to \ the \ vaccine $') 
2155            plt.ylabel(r'$ s_{bar}, \ characteristic \ expansion \ decrease $')
2156            plt.xscale('log')
2157            plt.yscale('log')
2158            plt.grid()
2159            plt.title(r'$Grid \ Search \ graph \ for \ \alpha \ and \ s_{bar} \ parameters. $')
2160            plt.colorbar()
2161
2162        return LSurface, Pn1n2_s, Pn0n0_s, svec
2163 
2164
2165    def _save_table(self, outpath, svec, Ps,Pn1n2_s, Pn0n0_s,  subset, unicountvals_1_d, unicountvals_2_d, indn1_d, indn2_d, print_expanded, pthresh, smedthresh):
2166        '''
2167        takes learned diffexpr model, Pn1n2_s*Ps, computes posteriors over (n1,n2) pairs, and writes to file a table of data with clones as rows and columns as measures of thier posteriors 
2168        print_expanded=True orders table as ascending by , else descending
2169        pthresh is the threshold in 'p-value'-like (null hypo) probability, 1-P(s>0|n1_i,n2_i), where i is the row (i.e. the clone) n.b. lower null prob implies larger probability of expansion
2170        smedthresh is the threshold on the posterior median, below which clones are discarded
2171        not usable by the user. 
2172        '''
2173
2174        Psn1n2_ps=Pn1n2_s*Ps[:,np.newaxis,np.newaxis] 
2175    
2176        #compute marginal likelihood (neglect renormalization , since it cancels in conditional below) 
2177        Pn1n2_ps=np.sum(Psn1n2_ps,0)
2178
2179        Ps_n1n2ps=Pn1n2_s*Ps[:,np.newaxis,np.newaxis]/Pn1n2_ps[np.newaxis,:,:]
2180        #compute cdf to get p-value to threshold on to reduce output size
2181        cdfPs_n1n2ps=np.cumsum(Ps_n1n2ps,0)
2182    
2183
2184        def dummy(row,cdfPs_n1n2ps,unicountvals_1_d,unicountvals_2_d):
2185            '''
2186            when applied to dataframe, generates 'p-value'-like (null hypo) probability, 1-P(s>0|n1_i,n2_i), where i is the row (i.e. the clone)
2187            '''
2188            return cdfPs_n1n2ps[np.argmin(np.fabs(svec)),row['Clone_count_1']==unicountvals_1_d,row['Clone_count_2']==unicountvals_2_d][0]
2189        dummy_part=partial(dummy,cdfPs_n1n2ps=cdfPs_n1n2ps,unicountvals_1_d=unicountvals_1_d,unicountvals_2_d=unicountvals_2_d)
2190    
2191        cdflabel=r'$1-P(s>0)$'
2192        subset[cdflabel]=subset.apply(dummy_part, axis=1)
2193        subset=subset[subset[cdflabel]<pthresh].reset_index(drop=True)
2194
2195        #go from clone count pair (n1,n2) to index in unicountvals_1_d and unicountvals_2_d
2196        data_pairs_ind_1=np.zeros((len(subset),),dtype=int)
2197        data_pairs_ind_2=np.zeros((len(subset),),dtype=int)
2198        for it in range(len(subset)):
2199            data_pairs_ind_1[it]=np.where(int(subset.iloc[it].Clone_count_1)==unicountvals_1_d)[0]
2200            data_pairs_ind_2[it]=np.where(int(subset.iloc[it].Clone_count_2)==unicountvals_2_d)[0]   
2201        #posteriors over data clones
2202        Ps_n1n2ps_datpairs=Ps_n1n2ps[:,data_pairs_ind_1,data_pairs_ind_2]
2203    
2204        #compute posterior metrics
2205        mean_est=np.zeros((len(subset),))
2206        max_est= np.zeros((len(subset),))
2207        slowvec= np.zeros((len(subset),))
2208        smedvec= np.zeros((len(subset),))
2209        shighvec=np.zeros((len(subset),))
2210        pval=0.025 #double-sided comparison statistical test
2211        pvalvec=[pval,0.5,1-pval] #bound criteria defining slow, smed, and shigh, respectively
2212        for it,column in enumerate(np.transpose(Ps_n1n2ps_datpairs)):
2213            mean_est[it]=np.sum(svec*column)
2214            max_est[it]=svec[np.argmax(column)]
2215            forwardcmf=np.cumsum(column)
2216            backwardcmf=np.cumsum(column[::-1])[::-1]
2217            inds=np.where((forwardcmf[:-1]<pvalvec[0]) & (forwardcmf[1:]>=pvalvec[0]))[0]
2218            slowvec[it]=np.mean(svec[inds+np.ones((len(inds),),dtype=int)])  #use mean in case there are two values
2219            inds=np.where((forwardcmf>=pvalvec[1]) & (backwardcmf>=pvalvec[1]))[0]
2220            smedvec[it]=np.mean(svec[inds])
2221            inds=np.where((forwardcmf[:-1]<pvalvec[2]) & (forwardcmf[1:]>=pvalvec[2]))[0]
2222            shighvec[it]=np.mean(svec[inds+np.ones((len(inds),),dtype=int)])
2223    
2224        colnames=(r'$\bar{s}$',r'$s_{max}$',r'$s_{3,high}$',r'$s_{2,med}$',r'$s_{1,low}$')
2225        for it,coldata in enumerate((mean_est,max_est,shighvec,smedvec,slowvec)):
2226            subset.insert(0,colnames[it],coldata)
2227        oldcolnames=( 'AACDR3',  'ntCDR3', 'Clone_count_1', 'Clone_count_2', 'Clone_fraction_1', 'Clone_fraction_2')
2228        newcolnames=('CDR3_AA', 'CDR3_nt',        r'$n_1$',        r'$n_2$',           r'$f_1$',           r'$f_2$')
2229        subset=subset.rename(columns=dict(zip(oldcolnames, newcolnames)))
2230    
2231        #select only clones whose posterior median pass the given threshold
2232        subset=subset[subset[r'$s_{2,med}$']>smedthresh]
2233    
2234        print("writing to: "+outpath)
2235        if print_expanded:
2236            subset=subset.sort_values(by=cdflabel,ascending=True)
2237            strout='expanded'
2238        else:
2239            subset=subset.sort_values(by=cdflabel,ascending=False)
2240            strout='contracted'
2241        subset.to_csv(outpath+'top_'+strout+'.csv',sep='\t',index=False)
2242
2243
2244
2245    def expansion_table(self, outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold):
2246
2247        '''
2248        generate the table of clones that have been significantly detected to be responsive to an acute stimuli.    
2249    
2250        Parameters
2251        ----------
2252        outpath  : str
2253            Name of the directory where to store the output table
2254        paras_1  : numpy array
2255            parameters of the noise model that has been learned at time_1
2256        paras_2  : numpy array
2257            parameters of the noise model that has been learned at time_2
2258        df       : pandas dataframe 
2259            pandas dataframe merging the two RepSeq data at time_1 and time_2
2260        noise_model : int
2261            choice of noise model 0: Poisson, 1: negative Binomial, 2: negative Binomial + Poisson  
2262        pval_threshold : float
2263            P-value threshold to detect and discriminate if a TCR clone has expanded 
2264        smed_threshold : float
2265            median of the log-fold change threshold to detect if a TCR clone has expanded 
2266        Returns
2267        -------
2268        data-frame - csv file
2269            the output is a csv file of columns : $s_{1-low}$, $s_{2-med}$, $s_{3-high}$, $s_{max}$, $\bar{s}$, $f_1$, $f_2$, $n_1$, $n_2$, 'CDR3_nt', 'CDR3_AA' and '$p$-value'
2270        '''
2271
2272        sparse_rep = self.get_sparserep(df)
2273        L_surface, Pn1n2_s_d, Pn0n0_s_d, svec = self._learning_dynamics_expansion(sparse_rep, paras_1, paras_2, noise_model)
2274        npoints= 50 # same as in learning_dynamics_expansion
2275        smax = 25.0     
2276        s_step = 0.1
2277        alpvec = np.logspace(-3,np.log10(0.99), npoints)
2278        sbarvec = np.linspace(0.01,5, npoints)
2279        maxinds=np.unravel_index(np.argmax(L_surface),np.shape(L_surface))
2280        optsbar=sbarvec[maxinds[0]]
2281        optalp=alpvec[maxinds[1]]
2282        optPs= self._get_Ps(optalp,optsbar,smax,s_step)
2283        pval_expanded = True
2284
2285        indn1,indn2,sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII = sparse_rep
2286
2287        self._save_table(outpath, svec, optPs, Pn1n2_s_d, Pn0n0_s_d,  df, unicountvals_1, unicountvals_2, indn1, indn2, pval_expanded, pval_threshold, smed_threshold)
2288
2289
2290#============================================Generate Synthetic Data =============================================================
2291
2292class Generator:
2293
2294    """
2295    A class used to build an object to generate in-Silico (synthetic) RepSeq samples, in the case of replicates at
2296    the same day and in the case of having 2 samples generated at an initial time for the first one and some time after (months, years)
2297    for the second one using the geometric Brownian motion model decribed in https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1.
2298    ...
2299    Methods
2300    -------
2301    gen_synthetic_data_Null(paras, noise_model, NreadsI,NreadsII,Nsamp):
2302        generate in-silico same day RepSeq replicates.
2303    generate_trajectories(tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI = '1e6', NreadsII = '1e6'):
2304        generate in-silico t_ime apart RepSeq samples.
2305    """
2306
2307    def _get_rhof(self, alpha_rho, fmin, freq_nbins=800, freq_dtype='float64'):
2308
2309        '''
2310        generates power law (power is alpha_rho) clone frequency distribution over 
2311        freq_nbins discrete logarithmically spaced frequences between fmin and 1 of dtype freq_dtype
2312        Outputs log probabilities obtained at log frequencies'''
2313        fmax=1e0
2314        logfvec=np.linspace(np.log10(fmin),np.log10(fmax),freq_nbins)
2315        logfvec=np.array(np.log(np.power(10,logfvec)) ,dtype=freq_dtype).flatten()  
2316        logrhovec=logfvec*alpha_rho
2317        integ=np.exp(logrhovec+logfvec,dtype=freq_dtype)
2318        normconst=np.log(np.dot(np.diff(logfvec)/2.,integ[1:]+integ[:-1]))
2319        logrhovec-=normconst 
2320        return logrhovec,logfvec
2321
2322    def _get_distsample(self, pmf,Nsamp,dtype='uint32'):
2323        '''
2324        generates Nsamp index samples of dtype (e.g. uint16 handles up to 65535 indices) from discrete probability mass function pmf.
2325        Handles multi-dimensional domain. N.B. Output is sorted.
2326        '''
2327        #assert np.sum(pmf)==1, "cmf not normalized!"
2328    
2329        shape = np.shape(pmf)
2330        sortindex = np.argsort(pmf, axis=None)#uses flattened array
2331        pmf = pmf.flatten()
2332        pmf = pmf[sortindex]
2333        cmf = np.cumsum(pmf)
2334        choice = np.random.uniform(high = cmf[-1], size = int(float(Nsamp)))
2335        index = np.searchsorted(cmf, choice)
2336        index = sortindex[index]
2337        index = np.unravel_index(index, shape)
2338        index = np.transpose(np.vstack(index))
2339        sampled_inds = np.array(index[np.argsort(index[:,0])],dtype=dtype)
2340        return sampled_inds
2341
2342    
2343    def gen_synthetic_data_Null(self, paras, noise_model, NreadsI,NreadsII,Nsamp):
2344        '''
2345        outputs an array of observed clone frequencies and corresponding dataframe of pair counts
2346        for a null model learned from a dataset pair with NreadsI and NreadsII number of reads, respectively.
2347        Crucial for RAM efficiency, sampling is conditioned on being observed in each of the three (n,0), (0,n'), and n,n'>0 conditions
2348        so that only Nsamp clones need to be sampled, rather than the N clones in the repertoire.
2349        Note that no explicit normalization is applied. It is assumed that the values in paras are consistent with N<f>=1 
2350        (e.g. were obtained through the learning done in this package).
2351        '''
2352
2353    
2354        alpha = paras[0] #power law exponent
2355        fmin=np.power(10,paras[-1])
2356        if noise_model<1:
2357            m_total=float(np.power(10, paras[3])) 
2358            r_c1=NreadsI/m_total
2359            r_c2=NreadsII/m_total
2360            r_cvec=[r_c1,r_c2]
2361        if noise_model<2:
2362            beta_mv= paras[1]
2363            alpha_mv=paras[2]
2364    
2365        logrhofvec,logfvec = self.get_rhof(alpha,fmin)
2366        fvec=np.exp(logfvec)
2367        dlogf=np.diff(logfvec)/2.
2368    
2369        #generate measurement model distribution, Pn_f
2370        Pn_f=np.empty((len(logfvec),),dtype=object) #len(logfvec) samplers
2371    
2372        #get value at n=0 to use for conditioning on n>0 (and get full Pn_f here if noise_model=1,2)
2373        m_max=1e3 #conditioned on n=0, so no edge effects
2374    
2375        Nreadsvec=(NreadsI,NreadsII)
2376        for it in range(2):
2377            Pn_f=np.empty((len(fvec),),dtype=object)
2378            if noise_model==2:
2379                m1vec=Nreadsvec[it]*fvec
2380                for find,m1 in enumerate(m1vec):
2381                    Pn_f[find]=poisson(m1)
2382                logPn0_f=-m1vec
2383            elif noise_model==1:
2384                m1=Nreadsvec[it]*fvec
2385                v1=m1+beta_mv*np.power(m1,alpha_mv)
2386                p=1-m1/v1
2387                n=m1*m1/v1/p
2388                for find,(n,p) in enumerate(zip(n,p)):
2389                    Pn_f[find]=nbinom(n,1-p)
2390                Pn0_f=np.asarray([Pn_find.pmf(0) for Pn_find in Pn_f])
2391                logPn0_f=np.log(Pn0_f)
2392            
2393            elif noise_model==0:
2394                m1=m_total*fvec
2395                v1=m1+beta_mv*np.power(m1,alpha_mv)
2396                p=1-m1/v1
2397                n=m1*m1/v1/p
2398                Pn0_f=np.zeros((len(fvec),))
2399                for find in range(len(Pn0_f)):
2400                    nbtmp=nbinom(n[find],1-p[find]).pmf(np.arange(m_max+1))
2401                    ptmp=poisson(r_cvec[it]*np.arange(m_max+1)).pmf(0)
2402                    Pn0_f[find]=np.sum(np.exp(np.log(nbtmp)+np.log(ptmp)))
2403                logPn0_f=np.log(Pn0_f)
2404            else:
2405                print('acq_model is 0,1,or 2 only')
2406            
2407            if it==0:
2408                Pn1_f=Pn_f
2409                logPn10_f=logPn0_f
2410            else:
2411                Pn2_f=Pn_f
2412                logPn20_f=logPn0_f
2413
2414        #3-quadrant q|f conditional distribution (qx0:n1>0,n2=0;q0x:n1=0,n2>0;qxx:n1,n2>0)
2415        logPqx0_f=np.log(1-np.exp(logPn10_f))+logPn20_f
2416        logPq0x_f=logPn10_f+np.log(1-np.exp(logPn20_f))
2417        logPqxx_f=np.log(1-np.exp(logPn10_f))+np.log(1-np.exp(logPn20_f))
2418        #3-quadrant q,f joint distribution
2419        logPfqx0=logPqx0_f+logrhofvec
2420        logPfq0x=logPq0x_f+logrhofvec
2421        logPfqxx=logPqxx_f+logrhofvec
2422        #3-quadrant q marginal distribution 
2423        Pqx0=np.trapz(np.exp(logPfqx0+logfvec),x=logfvec)
2424        Pq0x=np.trapz(np.exp(logPfq0x+logfvec),x=logfvec)
2425        Pqxx=np.trapz(np.exp(logPfqxx+logfvec),x=logfvec)
2426    
2427        #3 quadrant conditional f|q distribution
2428        Pf_qx0=np.where(Pqx0>0,np.exp(logPfqx0-np.log(Pqx0)),0)
2429        Pf_q0x=np.where(Pq0x>0,np.exp(logPfq0x-np.log(Pq0x)),0)
2430        Pf_qxx=np.where(Pqxx>0,np.exp(logPfqxx-np.log(Pqxx)),0)
2431    
2432        #3-quadrant q marginal distribution
2433        newPqZ=Pqx0 + Pq0x + Pqxx
2434        Pqx0/=newPqZ
2435        Pq0x/=newPqZ
2436        Pqxx/=newPqZ
2437
2438        Pfqx0=np.exp(logPfqx0)
2439        Pfq0x=np.exp(logPfq0x)
2440        Pfqxx=np.exp(logPfqxx)
2441    
2442        print('Model probs: '+str(Pqx0)+' '+str(Pq0x)+' '+str(Pqxx))
2443
2444        #get samples 
2445        num_samples=Nsamp
2446        q_samples=np.random.choice(range(3), num_samples, p=(Pqx0,Pq0x,Pqxx))
2447        vals,counts=np.unique(q_samples,return_counts=True)
2448        num_qx0=counts[0]
2449        num_q0x=counts[1]
2450        num_qxx=counts[2]
2451        print('q samples: '+str(sum(counts))+' '+str(num_qx0)+' '+str(num_q0x)+' '+str(num_qxx))
2452        print('q sampled probs: '+str(num_qx0/float(sum(counts)))+' '+str(num_q0x/float(sum(counts)))+' '+str(num_qxx/float(sum(counts))))
2453    
2454        #x0
2455        integ=np.exp(np.log(Pf_qx0)+logfvec)
2456        f_samples_inds= self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_qx0).flatten()
2457        f_sorted_inds=np.argsort(f_samples_inds)
2458        f_samples_inds=f_samples_inds[f_sorted_inds] 
2459        qx0_f_samples=fvec[f_samples_inds]
2460        find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True)
2461        qx0_samples=np.zeros((num_qx0,))
2462        if noise_model<1:
2463            qx0_m_samples=np.zeros((num_qx0,))
2464            #conditioning on n>0 applies an m-dependent factor to Pm_f, which can't be incorporated into the ppf method used for noise_model 1 and 2. 
2465            #We handle that here by using a custom finite range sampler, which has the drawback of having to define an upper limit. 
2466            #This works so long as n_max/r_c<<m_max, so depends on highest counts in data (n_max). My data had max counts of 1e3-1e4.
2467            #Alternatively, could define a custom scipy RV class by defining it's PMF, but has to be array-compatible which requires care. 
2468            m_samp_max=int(1e5) 
2469            mvec=np.arange(m_samp_max)   
2470    
2471        for it,find in enumerate(find_vals):
2472            if noise_model==0:      
2473                m1=m_total*fvec[find]
2474                v1=m1+beta_mv*np.power(m1,alpha_mv)
2475                p=1-m1/v1
2476                n=m1*m1/v1/p
2477                Pm1_f=nbinom(n,1-p)
2478            
2479                Pm1_f_adj=np.exp(np.log(1-np.exp(-r_c1*mvec))+np.log(Pm1_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c1+np.log(1-p))/(np.exp(r_c1)-p),n)))) #adds m-dependent factor due to conditioning on n>0...
2480                Pm1_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm1_f_adj/np.sum(Pm1_f_adj)))
2481                qx0_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm1_f_adj_obj.rvs(size=f_counts[it])
2482            
2483                mvals,minds,m_counts=np.unique(qx0_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True)
2484                for mit,m in enumerate(mvals):
2485                    Pn1_m1=poisson(r_c1*m)
2486                    samples=np.random.random(size=m_counts[mit]) * (1-Pn1_m1.cdf(0)) + Pn1_m1.cdf(0)
2487                    qx0_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn1_m1.ppf(samples)
2488 
2489        
2490            elif noise_model>0:
2491                samples=np.random.random(size=f_counts[it]) * (1-Pn1_f[find].cdf(0)) + Pn1_f[find].cdf(0)
2492                qx0_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn1_f[find].ppf(samples)
2493            else:
2494                print('acq_model is 0,1, or 2 only')
2495        qx0_pair_samples=np.hstack((qx0_samples[:,np.newaxis],np.zeros((num_qx0,1)))) 
2496    
2497        #0x
2498        integ=np.exp(np.log(Pf_q0x)+logfvec)
2499        f_samples_inds=self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_q0x).flatten()
2500        f_sorted_inds=np.argsort(f_samples_inds)
2501        f_samples_inds=f_samples_inds[f_sorted_inds] 
2502        q0x_f_samples=fvec[f_samples_inds]
2503        find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True)
2504        q0x_samples=np.zeros((num_q0x,))
2505        if noise_model<1:
2506            q0x_m_samples=np.zeros((num_q0x,))
2507        for it,find in enumerate(find_vals):
2508            if noise_model==0:
2509                m2=m_total*fvec[find]
2510                v2=m2+beta_mv*np.power(m2,alpha_mv)
2511                p=1-m2/v2
2512                n=m2*m2/v2/p
2513                Pm2_f=nbinom(n,1-p)
2514            
2515                Pm2_f_adj=np.exp(np.log(1-np.exp(-r_c2*mvec))+np.log(Pm2_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c2+np.log(1-p))/(np.exp(r_c2)-p),n)))) #adds m-dependent factor due to conditioning on n>0...
2516                Pm2_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm2_f_adj/np.sum(Pm2_f_adj)))
2517                q0x_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm2_f_adj_obj.rvs(size=f_counts[it])
2518
2519                mvals,minds,m_counts=np.unique(q0x_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True)
2520                for mit,m in enumerate(mvals):
2521                    Pn2_m2=poisson(r_c2*m)
2522                    samples=np.random.random(size=m_counts[mit]) * (1-Pn2_m2.cdf(0)) + Pn2_m2.cdf(0)
2523                    q0x_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn2_m2.ppf(samples)
2524        
2525                       
2526        
2527            elif noise_model > 0:
2528                samples=np.random.random(size=f_counts[it]) * (1-Pn2_f[find].cdf(0)) + Pn2_f[find].cdf(0)
2529                q0x_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn2_f[find].ppf(samples)
2530            else:
2531                print('acq_model is 0,1,or 2 only')
2532        q0x_pair_samples=np.hstack((np.zeros((num_q0x,1)),q0x_samples[:,np.newaxis]))
2533    
2534        #qxx
2535        integ=np.exp(np.log(Pf_qxx)+logfvec)
2536        f_samples_inds=self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_qxx).flatten()        
2537        f_sorted_inds=np.argsort(f_samples_inds)
2538        f_samples_inds=f_samples_inds[f_sorted_inds] 
2539        qxx_f_samples=fvec[f_samples_inds]
2540        find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True)
2541        qxx_n1_samples=np.zeros((num_qxx,))
2542        qxx_n2_samples=np.zeros((num_qxx,))
2543        if noise_model<1:
2544            qxx_m1_samples=np.zeros((num_qxx,))
2545            qxx_m2_samples=np.zeros((num_qxx,))
2546        for it,find in enumerate(find_vals):
2547            if noise_model==0:
2548                m1=m_total*fvec[find]
2549                v1=m1+beta_mv*np.power(m1,alpha_mv)
2550                p=1-m1/v1
2551                n=m1*m1/v1/p
2552                Pm1_f=nbinom(n,1-p)
2553            
2554                Pm1_f_adj=np.exp(np.log(1-np.exp(-r_c1*mvec))+np.log(Pm1_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c1+np.log(1-p))/(np.exp(r_c1)-p),n)))) #adds m-dependent factor due to conditioning on n>0...
2555                if np.sum(Pm1_f_adj)==0:
2556                    qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=1
2557                else:
2558                    Pm1_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm1_f_adj/np.sum(Pm1_f_adj)))
2559                    qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm1_f_adj_obj.rvs(size=f_counts[it])
2560
2561                mvals,minds,m_counts=np.unique(qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True)
2562                for mit,m in enumerate(mvals):
2563                    Pn1_m1=poisson(r_c1*m)
2564                    samples=np.random.random(size=m_counts[mit]) * (1-Pn1_m1.cdf(0)) + Pn1_m1.cdf(0)
2565                    qxx_n1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn1_m1.ppf(samples)
2566                
2567                m2=m_total*fvec[find]
2568                v2=m2+beta_mv*np.power(m2,alpha_mv)
2569                p=1-m2/v2
2570                n=m2*m2/v2/p
2571                Pm2_f=nbinom(n,1-p)
2572            
2573                Pm2_f_adj=np.exp(np.log(1-np.exp(-r_c2*mvec))+np.log(Pm2_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c2+np.log(1-p))/(np.exp(r_c2)-p),n)))) #adds m-dependent factor due to conditioning on n>0...
2574                if np.sum(Pm1_f_adj)==0:
2575                    qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=1
2576                else:
2577                    Pm2_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm2_f_adj/np.sum(Pm2_f_adj)))
2578                    qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm2_f_adj_obj.rvs(size=f_counts[it])
2579
2580                mvals,minds,m_counts=np.unique(qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True)
2581                for mit,m in enumerate(mvals):
2582                    Pn2_m2=poisson(r_c2*m)
2583                    samples=np.random.random(size=m_counts[mit]) * (1-Pn2_m2.cdf(0)) + Pn2_m2.cdf(0)
2584                    qxx_n2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn2_m2.ppf(samples)    
2585
2586                          
2587            elif noise_model>0:
2588                samples=np.random.random(size=f_counts[it]) * (1-Pn1_f[find].cdf(0)) + Pn1_f[find].cdf(0)
2589                qxx_n1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn1_f[find].ppf(samples)
2590                samples=np.random.random(size=f_counts[it]) * (1-Pn2_f[find].cdf(0)) + Pn2_f[find].cdf(0)
2591                qxx_n2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn2_f[find].ppf(samples)
2592            else:
2593                print('acq_model is 0,1, or 2 only')
2594            
2595        qxx_pair_samples=np.hstack((qxx_n1_samples[:,np.newaxis],qxx_n2_samples[:,np.newaxis]))
2596    
2597        pair_samples=np.vstack((q0x_pair_samples,qx0_pair_samples,qxx_pair_samples))
2598        f_samples=np.concatenate((q0x_f_samples,qx0_f_samples,qxx_f_samples))
2599        output_m_samples=False
2600        if noise_model<1 and output_m_samples:                
2601            m1_samples=np.concatenate((q0x_m1_samples,qx0_m1_samples,qxx_m1_samples))
2602            m2_samples=np.concatenate((q0x_m2_samples,qx0_m2_samples,qxx_m2_samples))
2603    
2604        pair_samples_df=pd.DataFrame({'Clone_count_1':pair_samples[:,0],'Clone_count_2':pair_samples[:,1]})
2605
2606        pair_samples_df['Clone_fraction_1'] = pair_samples_df['Clone_count_1']/np.sum(pair_samples_df['Clone_count_1'])
2607        pair_samples_df['Clone_fraction_2'] = pair_samples_df['Clone_count_2']/np.sum(pair_samples_df['Clone_count_2'])
2608    
2609        return f_samples,pair_samples_df
2610
2611
2612    def generate_trajectories(self, tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI = '1e6', NreadsII = '1e6'):
2613
2614
2615        """
2616        generate in-silico t_ime apart RepSeq samples.
2617        
2618        Parameters
2619        ----------
2620        paras_1  : numpy array
2621            parameters of the noise model that has been learnt at time_1
2622        paras_2  : numpy array
2623            parameters of the noise model that has been learnt at time_2
2624        method   : str
2625            'negative_binomial' or 'poisson'
2626        tau      : float
2627            first time-scale parameter of the dynamics
2628        theta    : float
2629            second time-scale parameter of the dynamics
2630        t_ime    : float
2631            number of years between both synthetic sampling (between time_1 and time_2)
2632        filename : str
2633            name of the file in which the dataframe is stored  
2634        Returns
2635        -------
2636        data-frame - csv file
2637            the output is a csv file of columns : 'Clone_count_1' (at time_1) 'Clone_count_2' (at time_2) and the frequency counterparts 'Clone_frequency_1' and 'Clone_frequency_2'
2638        """
2639
2640        np.seterr(divide = 'ignore') 
2641        np.warnings.filterwarnings('ignore')
2642
2643        method = 'negative_binomial'
2644
2645
2646        # Synthetic data generation
2647
2648        print('execution starting...')
2649
2650        st = time.time()
2651
2652        #Values of the parameters
2653        A = -1/tau
2654        B = 1/theta
2655        N_0 = 40
2656        NreadsI = float(NreadsI)
2657        NreadsII = float(NreadsII)
2658
2659        t = float(t_ime)
2660
2661        if NreadsI == NreadsII:
2662            key_sym = '_sym_'
2663
2664        else:
2665            key_sym = '_asym_'
2666
2667        # Name of the directory
2668
2669
2670        dirName = 'output'    
2671        os.makedirs(dirName, exist_ok=True) 
2672
2673        paras = paras_1 #Just put a and b of the negative binomiale distribution [0.7, 1.1]
2674        alpha = -1 +2*A/B
2675        #print('alpha : ' + str(alpha))
2676
2677        #1/ Generate log-population at initial time from steady-state distribution + GBM diffusion trajectories for 2 years
2678        x_i_LB, x_f_LB, Prop_Matrix_LB, p_ext_LB, results_extinction_LB, time_vec_LB, results_extinction_source_LB, x_source_LB = _generator_diffusion_LB(A, B, N_0, t)
2679        
2680        #x_i_LB, x_f_LB, Prop_Matrix, p_ext, results_extinction  = generator_diffusion_LB(B, A, N_0, t)
2681        N_cells_day_0_LB, N_cells_day_1_LB = np.sum(np.exp(x_i_LB)), np.sum(np.exp(x_f_LB)) + np.sum(np.exp(x_source_LB))  #N_cells_final_LB
2682        print('NUMBER OF CELLS AT INITIAL TIME')
2683        print(N_cells_day_0_LB)
2684
2685        print('NUMBER OF CELLS AT FINAL TIME')
2686        print(N_cells_day_1_LB)
2687
2688        #print('SHAPE_X_I ' +  str(np.shape(x_i_LB)))
2689        #print('SHAPE_X_F ' +  str(np.shape(x_f_LB)))
2690
2691
2692        if method == 'negative_binomial':
2693
2694            df_diffusion_LB  = _experimental_sampling_diffusion_NegBin(NreadsI, NreadsII, paras, x_i_LB, x_f_LB, N_cells_day_0_LB, N_cells_day_1_LB)
2695            df_diffusion_LB.to_csv(filename + '.csv' , sep= '\t')
2696
2697        elif method == 'poisson': 
2698
2699            df_diffusion_LB  = _experimental_sampling_diffusion_Poisson(NreadsI, NreadsII, x_i_LB, x_f_LB, t, N_cells_day_0_LB, N_cells_day_1_LB)
2700            df_diffusion_LB.to_csv(filename + '.csv' , sep= '\t')
2701
2702   
class longitudinal_analysis:
 692class longitudinal_analysis():
 693
 694    """
 695    This class provides some tool to inspect and compute some simple statistics on longitudinal data associated with
 696    one individual (it is independent of the NoisET software).
 697    
 698    ...
 699    Attributes
 700    ----------
 701    clone_count_label : str
 702        label in the clonotype tables indicating the clonotype count
 703    seq_label : str
 704        label in the clonotype tables indicating the sequence of the receptor
 705    clones : dict of pandas.DataFrame
 706        dictionary containing the clonotype tables as pandas frames. The keys are 
 707        strings "patient_time", replicated are merged. Created in the initalization
 708    times : list of float
 709        ordered times of the imported tables. Created in the initialization
 710    unique_clones : list of str
 711        list of all the unique clonotype sequences in all the time points
 712    time_occurrence : list of int
 713        number of time points in which each clonotype appears. The index
 714        refers to the clonotype in the unique_clones list
 715    Methods
 716    -------
 717    compute_clone_time_occurrence()
 718        It creates two new attribues: the list of uniqe clonotypes in all the dataset 
 719        "self.unique_clones" and the time occurrence of each of them "self.time_occurrence".
 720        the time occurrence is the number of time points in which the clone appears.
 721    plot_hist_persistence(figsize=(12,10))
 722        It plots the distribution of time occurrence of the unique clonotypes
 723    top_clones_set(n_top_clones)
 724        Compute the set of top clones as the union of the "n_top_clones" most abundant
 725        clonotype in each time point
 726    build_traj_frame(top_clones_set)
 727        Compute the set of top clones as the union of the "n_top_clones" most abundant
 728        clonotype in each time point
 729    plot_trajectories(n_top_clones, colormap='viridis', figsize=(12,10))
 730        Function to plot the trajectories of the first "n_top_clones". Colors of the
 731        trajectories represent the cumulative frequency in all the time points.
 732    PCA_traj(n_top_clones, nclus=4)
 733        Perform PCA over the normalized trajectories of n_top_clones TCR clones.
 734        The normalization consists in dividing the whole trajectory by its maximum value.
 735        After PCA the trajectories are clustered in the two principal componets space
 736        with a hierarchical clustering algorithm.
 737    plot_PCA2(n_top_clones, nclus=4, colormap='viridis', figsize=(12,10))
 738        Plotting the trajectories in the space of their two principal components and
 739        clustering them as in "PCA_traj".
 740    plot_PCA_clusters_traj(n_top_clones, nclus=4, colormap='viridis', figsize=(12,10))
 741        Plotting the trajectories grouped by PCA clusters
 742    """
 743
 744
 745
 746
 747    def __init__(self, patient, data_folder, sequence_label='N. Seq. CDR3', clone_count_label='Clone count',
 748                 replicate1_label='_F1', replicate2_label='_F2', separator='\t'):
 749        """ 
 750        Import all the clonotypes of a given patient and store them in the dictionary "self.clones".
 751        It also creates the list of times "self.times". During this process the replicates at the
 752        same time points are merged together.
 753        The names of the tables containing TCR should be structured as "patient_time_replicate.csv".
 754        Those tables should be cvs files compressed in a zip archive (see the example notebook).
 755        Parameters
 756        ----------
 757        patient : str
 758            The ID of the patient
 759        data_folder : str
 760            folder name containing the csv files listing the T-cell receptors
 761        separator : str
 762            separator symbol in the csv tables
 763        """
 764        
 765        self.clone_count_label = clone_count_label
 766        self.seq_label = sequence_label
 767        self.unique_clones = None
 768        self.time_occurrence = None
 769        self.times = []
 770        clones_repl = dict()
 771
 772        # Iteration over all the file in the folder for importing each table
 773        for file_name in os.listdir(data_folder):
 774        # If the name before the underscore corresponds to the chosen patient..
 775            if file_name.split('_')[0] == patient:
 776                # Import the table
 777                frame = pd.read_csv(data_folder+file_name, sep='\t', compression=dict(method='zip'))
 778                # Store it in a dictionary where the key contains the patient, the time
 779                # and the replicate.
 780                clones_repl[file_name[:-10]] = frame
 781                # Reading the time from the name and storing it
 782                self.times.append(int(file_name.split('_')[1]))
 783                print('Clonotypes',file_name[:-10],'imported')
 784
 785        # Sorting the unique times
 786        self.times = np.sort(list(set(self.times)))
 787        self.clones = self._merge_replicates(patient, clones_repl, replicate1_label, replicate2_label)
 788        
 789
 790    def _merge_replicates(self, patient, clones_repl, repl1_label, repl2_label):
 791        
 792        clones_merged = dict()
 793
 794        # Iteration over the times
 795        for it, t in enumerate(self.times):
 796            # Building the ids correponding at 1st and 2nd replicate at given time point
 797            id_F1 = patient + '_' + str(t) + repl1_label
 798            id_F2 = patient + '_' + str(t) + repl2_label
 799            # Below all the rows of one table are appended to the rows of the other
 800            merged_replicates = clones_repl[id_F1].merge(clones_repl[id_F2], how='outer')
 801            # But there are common clonotypes that now appear in two different rows 
 802            # (one for the first and one for the second replicate)! 
 803            # Below we collapse those common sequences and the counts of the two are summed 
 804            merged_replicates = merged_replicates.groupby(self.seq_label, as_index=False).agg({self.clone_count_label:sum})
 805            depth = merged_replicates[self.clone_count_label].sum()
 806            merged_replicates['Clone freq'] = merged_replicates[self.clone_count_label] / depth
 807            merged_replicates = merged_replicates.sort_values('Clone freq', ascending=False)
 808            # The merged table is then added to the dictionary
 809            clones_merged[patient + '_' + str(t)] = merged_replicates
 810
 811        return clones_merged
 812
 813    
 814    def compute_clone_time_occurrence(self):
 815
 816        """
 817        It creates two new attribues: the list of uniqe clonotypes in all the dataset 
 818        "self.unique_clones" and the time occurrence of each of them "self.time_occurrence".
 819        the time occurrence is the number of time points in which the clone appears.
 820        """
 821
 822        all_clones = np.array([])
 823        for id_, cl in self.clones.items():
 824            all_clones = np.append(all_clones, cl[self.seq_label].values)
 825
 826        # The following function returns the list of unique clonotypes and the number of
 827        # repetitions for each of them. 
 828        # Note that the number of repetitions is exactly the time occurrence
 829        self.unique_clones, self.time_occurrence = np.unique(all_clones, return_counts=True)
 830
 831
 832    def plot_hist_persistence(self, figsize=(12,10)):
 833
 834        """
 835        It plots the distribution of time occurrence of the unique clonotypes
 836        Parameters
 837        ----------
 838        figsize : tuple
 839            width, height in inches
 840        
 841        Returns
 842        -------
 843        ax : matplotlib.axes._subplots.AxesSubplot
 844            axes where to draw the plot
 845        fig : matplotlib.figure.Figure
 846            matplotlib figure
 847        """
 848
 849        if type(self.unique_clones) != np.ndarray:
 850            self.compute_clone_time_occurrence()
 851            
 852        fig, ax = plt.subplots(figsize=figsize)
 853
 854        plt.rc('xtick', labelsize = 30)
 855        plt.rc('ytick', labelsize = 30)
 856
 857        ax.set_yscale('log')
 858        ax.set_xlabel('Time occurrence', fontsize = 30)
 859        ax.set_ylabel('Counts', fontsize = 30)
 860        ax.hist(self.time_occurrence, bins=np.arange(1,len(self.times)+2)-0.5, rwidth=0.6)
 861        
 862        return fig, ax
 863        
 864
 865    def top_clones_set(self, n_top_clones):
 866        
 867        """ 
 868        Compute the set of top clones as the union of the "n_top_clones" most abundant
 869        clonotype in each time point
 870        Parameters
 871        ----------
 872        n_top_clones : int
 873            number of most abundant clontypes in each time point
 874        Returns
 875        -------
 876        top_clones : set of str
 877            set of top clones
 878        """
 879
 880        top_clones = set()
 881        for id_, cl in self.clones.items():
 882            top_clones_at_time = cl.sort_values(self.clone_count_label, ascending=False)[:n_top_clones]
 883            top_clones = top_clones.union(top_clones_at_time[self.seq_label].values)
 884        return top_clones
 885    
 886
 887    def build_traj_frame(self, clone_set):
 888        
 889        """ 
 890        This builds a dataframe containing the frequency at all the time points for each 
 891        of the clonotypes specified in clone_set.
 892        The dataframe has also a field that contains the cumulative frequency.
 893        Parameters
 894        ----------
 895        clones_set : iterable of str
 896            list of clonotypes whose temporal trajectory is drawn
 897        Returns
 898        -------
 899        traj_frame : pandas.DataFrame
 900            dataframe containing the frequency at all the time points
 901        """
 902
 903        traj_frame = pd.DataFrame(index=clone_set)
 904        traj_frame['Clone cumul freq'] = 0
 905
 906        for id_, cl in self.clones.items(): 
 907
 908            # Getting the time from the index of clones_merged
 909            t = id_.split('_')[1]
 910            # Selecting the clonotypes that are both in the frame at the given time 
 911            # point and in the list of top_clones_set
 912            top_clones_at_time = clone_set.intersection(set(cl[self.seq_label]))
 913            # Creating a sub-dataframe containing only the clone in top_clones_at_time
 914            clones_at_time = cl.set_index(self.seq_label).loc[top_clones_at_time]
 915            # Creating a new column in the trajectory frames for the counts at that time
 916            traj_frame['t'+str(t)] = traj_frame.index.map(clones_at_time['Clone freq'].to_dict())
 917            # The clonotypes not present at that time are NaN. Below we convert NaN in 0s
 918            traj_frame = traj_frame.fillna(0)
 919            # The cumulative count for each clonotype is updated
 920            traj_frame['Clone cumul freq'] += traj_frame['t'+str(t)]
 921        
 922        return traj_frame 
 923
 924
 925
 926    # Plot clonal trajectories
 927
 928
 929    def plot_trajectories(self, n_top_clones, colormap='viridis', figsize=(12,10)):
 930
 931        """
 932        Function to plot the trajectories of the first "n_top_clones". Colors of the
 933        trajectories represent the cumulative frequency in all the time points.
 934        
 935        Parameters
 936        ----------
 937        n_top_clones : int
 938            number of most abundant clontypes in each time point
 939        colormap  : str
 940            colors of the trajectories
 941            
 942        figsize : tuple
 943            width, height in inches
 944        Returns
 945        -------
 946        ax : matplotlib.axes._subplots.AxesSubplot
 947            axes where to draw the plot
 948        fig : matplotlib.figure.Figure
 949            matplotlib figure
 950        """
 951
 952        cmap = cm.get_cmap(colormap)
 953        top_clones = self.top_clones_set(n_top_clones)
 954        traj_frame = self.build_traj_frame(top_clones)
 955        
 956        fig, ax = plt.subplots(figsize=figsize)
 957        plt.rc('xtick', labelsize = 30)
 958        plt.rc('ytick', labelsize = 30)
 959        ax.set_yscale('log')
 960        ax.set_xlabel('time', fontsize = 25)
 961        ax.set_ylabel('frequency', fontsize = 25)
 962
 963        log_counts = np.log10(traj_frame['Clone cumul freq'].values)
 964        max_log_count = max(log_counts)
 965        min_log_count = min(log_counts)
 966
 967        for id_, row in traj_frame.iterrows():
 968            traj = row.drop(['Clone cumul freq']).to_numpy()
 969            log_count = np.log10(row['Clone cumul freq'])
 970            norm_log_count = (log_count-min_log_count)/(max_log_count-min_log_count)
 971            plt.plot(self.times, traj, c=cmap(norm_log_count))
 972
 973
 974        sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=min(log_counts), vmax=max(log_counts)))
 975        cb = plt.colorbar(sm)
 976        cb.set_label('Log10 cumulative frequency', fontsize = 25)
 977
 978        return fig, ax
 979    
 980
 981    def PCA_traj(self, n_top_clones, nclus=4):
 982
 983        """
 984        Perform PCA over the normalized trajectories of n_top_clones TCR clones.
 985        The normalization consists in dividing the whole trajectory by its maximum value.
 986        After PCA the trajectories are clustered in the two principal componets space
 987        with a hierarchical clustering algorithm.
 988        
 989        Parameters
 990        ----------
 991        n_top_clones : int
 992            number of most abundant clontypes in each time point to consider in the PCA
 993        nclus : float
 994            number of clusters 
 995        
 996        Returns
 997        -------
 998        pca : sklearn.decomposition._pca.PCA
 999            object containing the result of the principal component analysis
1000            
1001        clustering : sklearn.cluster._agglomerative.AgglomerativeClustering
1002            object containing the result of the hierarchical clustering
1003        """
1004
1005        #Getting the top n_top_clones clonotypes at each time point
1006        top_clones = self.top_clones_set(n_top_clones)
1007        #Building a trajectory dataframe
1008        traj_frame = self.build_traj_frame(top_clones)
1009
1010        #Converting it in a numpy matrix
1011        traj_matrix = traj_frame.drop(['Clone cumul freq'], axis = 1).to_numpy()
1012
1013        # Normalize each trajectory by its maximum
1014        norm_traj_matrix = traj_matrix/np.max(traj_matrix, axis=1)[:, np.newaxis]
1015
1016        pca = PCA(n_components =2).fit(norm_traj_matrix.T)
1017        clustering = AgglomerativeClustering(n_clusters = nclus)
1018        clustering = clustering.fit(pca.components_.T)
1019
1020        return pca, clustering
1021
1022
1023    def plot_PCA2(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)):
1024
1025        """
1026        Plotting the trajectories in the space of their two principal components and
1027        clustering them as in "PCA_traj".
1028        
1029        Parameters
1030        ----------
1031        n_top_clones : int
1032            number of most abundant clontypes in each time point to consider in the PCA
1033        nclus : float
1034            number of clusters 
1035        colormap : str
1036            colormap indicating the different clusters
1037        figsize : tuple
1038            width, height in inches
1039        Returns
1040        -------
1041        ax : matplotlib.axes._subplots.AxesSubplot
1042            axes where to draw the plot
1043        fig : matplotlib.figure.Figure
1044            matplotlib figure
1045        """
1046
1047
1048        cmap = cm.get_cmap(colormap)
1049        pca, clustering = self.PCA_traj(n_top_clones, nclus)
1050
1051        fig, ax = plt.subplots(figsize=figsize)
1052        ax.set_title('PCA components (%i trajs)' %pca.n_features_, fontsize = 25)
1053        ax.set_xlabel('First component (expl var: %3.2f)'%pca.explained_variance_ratio_[0], fontsize = 25)
1054        ax.set_ylabel('Second component (expl var: %3.2f)'%pca.explained_variance_ratio_[1], fontsize = 25)
1055        for c_ind in range(clustering.n_clusters):
1056            x = pca.components_[0][clustering.labels_ == c_ind]
1057            y = pca.components_[1][clustering.labels_ == c_ind]
1058            ax.scatter(x, y, alpha=0.2, color=cmap(c_ind/clustering.n_clusters))
1059        
1060        return fig, ax
1061    
1062
1063    def plot_PCA_clusters_traj(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)):
1064
1065        """
1066        Plotting the trajectories grouped by PCA clusters
1067        
1068        Parameters
1069        ----------
1070        n_top_clones : int
1071            number of most abundant clontypes in each time point to consider in the PCA
1072        nclus : float
1073            number of clusters 
1074        colormap : str
1075            colormap indicating the different clusters
1076        figsize : tuple
1077            width, height in inches
1078        Returns
1079        -------
1080        axs : tuple of matplotlib.axes._subplots.AxesSubplot
1081            axis where to draw the plot
1082        fig : matplotlib.figure.Figure
1083            matplotlib figure
1084        """
1085
1086        cmap = cm.get_cmap(colormap)
1087        pca, clustering = self.PCA_traj(n_top_clones, nclus)
1088
1089        n_cl = clustering.n_clusters
1090
1091        #Getting the top n_top_clones clonotypes at each time point
1092        top_clones = self.top_clones_set(n_top_clones)
1093        #Building a trajectory dataframe
1094        traj_frame = self.build_traj_frame(top_clones)
1095
1096        #Converting it in a numpy matrix
1097        traj_matrix = traj_frame.drop(['Clone cumul freq'], axis=1).to_numpy()
1098
1099        # Normalize each trajectory by its maximum
1100        norm_traj_matrix = traj_matrix/np.max(traj_matrix, axis=1)[:, np.newaxis]
1101
1102        fig, axs = plt.subplots(2, n_cl, figsize=(5*n_cl, 12))
1103        for cl in range(n_cl):
1104            trajs = norm_traj_matrix[clustering.labels_ == cl]
1105            axs[0][cl].set_xlabel('Time', fontsize = 15)
1106            axs[0][cl].set_ylabel('Normalized frequency', fontsize = 15)
1107            axs[1][cl].set_xlabel('Time', fontsize = 15)
1108            axs[1][cl].set_ylabel('Normalized frequency', fontsize = 15)
1109            for traj in trajs:
1110                axs[0][cl].plot(self.times, traj, alpha=0.2, color=cmap(cl/n_cl))
1111            axs[1][cl].set_ylim(0,1)
1112            axs[1][cl].errorbar(self.times, np.mean(trajs, axis=0), 
1113                                yerr=np.std(trajs, axis=0), lw=3, color=cmap(cl/n_cl))
1114            #axs[1][cl].fill_between(times, np.quantile(trajs, 0.75, axis=0), np.quantile(trajs, 0.25, axis=0), color=colors[cl])
1115               
1116        plt.tight_layout()
1117        return fig, axs

This class provides some tool to inspect and compute some simple statistics on longitudinal data associated with one individual (it is independent of the NoisET software).

...

Attributes
  • clone_count_label (str): label in the clonotype tables indicating the clonotype count
  • seq_label (str): label in the clonotype tables indicating the sequence of the receptor
  • clones (dict of pandas.DataFrame): dictionary containing the clonotype tables as pandas frames. The keys are strings "patient_time", replicated are merged. Created in the initalization
  • times (list of float): ordered times of the imported tables. Created in the initialization
  • unique_clones (list of str): list of all the unique clonotype sequences in all the time points
  • time_occurrence (list of int): number of time points in which each clonotype appears. The index refers to the clonotype in the unique_clones list
Methods

compute_clone_time_occurrence() It creates two new attribues: the list of uniqe clonotypes in all the dataset "self.unique_clones" and the time occurrence of each of them "self.time_occurrence". the time occurrence is the number of time points in which the clone appears. plot_hist_persistence(figsize=(12,10)) It plots the distribution of time occurrence of the unique clonotypes top_clones_set(n_top_clones) Compute the set of top clones as the union of the "n_top_clones" most abundant clonotype in each time point build_traj_frame(top_clones_set) Compute the set of top clones as the union of the "n_top_clones" most abundant clonotype in each time point plot_trajectories(n_top_clones, colormap='viridis', figsize=(12,10)) Function to plot the trajectories of the first "n_top_clones". Colors of the trajectories represent the cumulative frequency in all the time points. PCA_traj(n_top_clones, nclus=4) Perform PCA over the normalized trajectories of n_top_clones TCR clones. The normalization consists in dividing the whole trajectory by its maximum value. After PCA the trajectories are clustered in the two principal componets space with a hierarchical clustering algorithm. plot_PCA2(n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)) Plotting the trajectories in the space of their two principal components and clustering them as in "PCA_traj". plot_PCA_clusters_traj(n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)) Plotting the trajectories grouped by PCA clusters

longitudinal_analysis( patient, data_folder, sequence_label='N. Seq. CDR3', clone_count_label='Clone count', replicate1_label='_F1', replicate2_label='_F2', separator='\t')
747    def __init__(self, patient, data_folder, sequence_label='N. Seq. CDR3', clone_count_label='Clone count',
748                 replicate1_label='_F1', replicate2_label='_F2', separator='\t'):
749        """ 
750        Import all the clonotypes of a given patient and store them in the dictionary "self.clones".
751        It also creates the list of times "self.times". During this process the replicates at the
752        same time points are merged together.
753        The names of the tables containing TCR should be structured as "patient_time_replicate.csv".
754        Those tables should be cvs files compressed in a zip archive (see the example notebook).
755        Parameters
756        ----------
757        patient : str
758            The ID of the patient
759        data_folder : str
760            folder name containing the csv files listing the T-cell receptors
761        separator : str
762            separator symbol in the csv tables
763        """
764        
765        self.clone_count_label = clone_count_label
766        self.seq_label = sequence_label
767        self.unique_clones = None
768        self.time_occurrence = None
769        self.times = []
770        clones_repl = dict()
771
772        # Iteration over all the file in the folder for importing each table
773        for file_name in os.listdir(data_folder):
774        # If the name before the underscore corresponds to the chosen patient..
775            if file_name.split('_')[0] == patient:
776                # Import the table
777                frame = pd.read_csv(data_folder+file_name, sep='\t', compression=dict(method='zip'))
778                # Store it in a dictionary where the key contains the patient, the time
779                # and the replicate.
780                clones_repl[file_name[:-10]] = frame
781                # Reading the time from the name and storing it
782                self.times.append(int(file_name.split('_')[1]))
783                print('Clonotypes',file_name[:-10],'imported')
784
785        # Sorting the unique times
786        self.times = np.sort(list(set(self.times)))
787        self.clones = self._merge_replicates(patient, clones_repl, replicate1_label, replicate2_label)

Import all the clonotypes of a given patient and store them in the dictionary "self.clones". It also creates the list of times "self.times". During this process the replicates at the same time points are merged together. The names of the tables containing TCR should be structured as "patient_time_replicate.csv". Those tables should be cvs files compressed in a zip archive (see the example notebook).

Parameters
  • patient (str): The ID of the patient
  • data_folder (str): folder name containing the csv files listing the T-cell receptors
  • separator (str): separator symbol in the csv tables
def compute_clone_time_occurrence(self):
814    def compute_clone_time_occurrence(self):
815
816        """
817        It creates two new attribues: the list of uniqe clonotypes in all the dataset 
818        "self.unique_clones" and the time occurrence of each of them "self.time_occurrence".
819        the time occurrence is the number of time points in which the clone appears.
820        """
821
822        all_clones = np.array([])
823        for id_, cl in self.clones.items():
824            all_clones = np.append(all_clones, cl[self.seq_label].values)
825
826        # The following function returns the list of unique clonotypes and the number of
827        # repetitions for each of them. 
828        # Note that the number of repetitions is exactly the time occurrence
829        self.unique_clones, self.time_occurrence = np.unique(all_clones, return_counts=True)

It creates two new attribues: the list of uniqe clonotypes in all the dataset "self.unique_clones" and the time occurrence of each of them "self.time_occurrence". the time occurrence is the number of time points in which the clone appears.

def plot_hist_persistence(self, figsize=(12, 10)):
832    def plot_hist_persistence(self, figsize=(12,10)):
833
834        """
835        It plots the distribution of time occurrence of the unique clonotypes
836        Parameters
837        ----------
838        figsize : tuple
839            width, height in inches
840        
841        Returns
842        -------
843        ax : matplotlib.axes._subplots.AxesSubplot
844            axes where to draw the plot
845        fig : matplotlib.figure.Figure
846            matplotlib figure
847        """
848
849        if type(self.unique_clones) != np.ndarray:
850            self.compute_clone_time_occurrence()
851            
852        fig, ax = plt.subplots(figsize=figsize)
853
854        plt.rc('xtick', labelsize = 30)
855        plt.rc('ytick', labelsize = 30)
856
857        ax.set_yscale('log')
858        ax.set_xlabel('Time occurrence', fontsize = 30)
859        ax.set_ylabel('Counts', fontsize = 30)
860        ax.hist(self.time_occurrence, bins=np.arange(1,len(self.times)+2)-0.5, rwidth=0.6)
861        
862        return fig, ax

It plots the distribution of time occurrence of the unique clonotypes

Parameters
  • figsize (tuple): width, height in inches
Returns
  • ax (matplotlib.axes._subplots.AxesSubplot): axes where to draw the plot
  • fig (matplotlib.figure.Figure): matplotlib figure
def top_clones_set(self, n_top_clones):
865    def top_clones_set(self, n_top_clones):
866        
867        """ 
868        Compute the set of top clones as the union of the "n_top_clones" most abundant
869        clonotype in each time point
870        Parameters
871        ----------
872        n_top_clones : int
873            number of most abundant clontypes in each time point
874        Returns
875        -------
876        top_clones : set of str
877            set of top clones
878        """
879
880        top_clones = set()
881        for id_, cl in self.clones.items():
882            top_clones_at_time = cl.sort_values(self.clone_count_label, ascending=False)[:n_top_clones]
883            top_clones = top_clones.union(top_clones_at_time[self.seq_label].values)
884        return top_clones

Compute the set of top clones as the union of the "n_top_clones" most abundant clonotype in each time point

Parameters
  • n_top_clones (int): number of most abundant clontypes in each time point
Returns
  • top_clones (set of str): set of top clones
def build_traj_frame(self, clone_set):
887    def build_traj_frame(self, clone_set):
888        
889        """ 
890        This builds a dataframe containing the frequency at all the time points for each 
891        of the clonotypes specified in clone_set.
892        The dataframe has also a field that contains the cumulative frequency.
893        Parameters
894        ----------
895        clones_set : iterable of str
896            list of clonotypes whose temporal trajectory is drawn
897        Returns
898        -------
899        traj_frame : pandas.DataFrame
900            dataframe containing the frequency at all the time points
901        """
902
903        traj_frame = pd.DataFrame(index=clone_set)
904        traj_frame['Clone cumul freq'] = 0
905
906        for id_, cl in self.clones.items(): 
907
908            # Getting the time from the index of clones_merged
909            t = id_.split('_')[1]
910            # Selecting the clonotypes that are both in the frame at the given time 
911            # point and in the list of top_clones_set
912            top_clones_at_time = clone_set.intersection(set(cl[self.seq_label]))
913            # Creating a sub-dataframe containing only the clone in top_clones_at_time
914            clones_at_time = cl.set_index(self.seq_label).loc[top_clones_at_time]
915            # Creating a new column in the trajectory frames for the counts at that time
916            traj_frame['t'+str(t)] = traj_frame.index.map(clones_at_time['Clone freq'].to_dict())
917            # The clonotypes not present at that time are NaN. Below we convert NaN in 0s
918            traj_frame = traj_frame.fillna(0)
919            # The cumulative count for each clonotype is updated
920            traj_frame['Clone cumul freq'] += traj_frame['t'+str(t)]
921        
922        return traj_frame 

This builds a dataframe containing the frequency at all the time points for each of the clonotypes specified in clone_set. The dataframe has also a field that contains the cumulative frequency.

Parameters
  • clones_set (iterable of str): list of clonotypes whose temporal trajectory is drawn
Returns
  • traj_frame (pandas.DataFrame): dataframe containing the frequency at all the time points
def plot_trajectories(self, n_top_clones, colormap='viridis', figsize=(12, 10)):
929    def plot_trajectories(self, n_top_clones, colormap='viridis', figsize=(12,10)):
930
931        """
932        Function to plot the trajectories of the first "n_top_clones". Colors of the
933        trajectories represent the cumulative frequency in all the time points.
934        
935        Parameters
936        ----------
937        n_top_clones : int
938            number of most abundant clontypes in each time point
939        colormap  : str
940            colors of the trajectories
941            
942        figsize : tuple
943            width, height in inches
944        Returns
945        -------
946        ax : matplotlib.axes._subplots.AxesSubplot
947            axes where to draw the plot
948        fig : matplotlib.figure.Figure
949            matplotlib figure
950        """
951
952        cmap = cm.get_cmap(colormap)
953        top_clones = self.top_clones_set(n_top_clones)
954        traj_frame = self.build_traj_frame(top_clones)
955        
956        fig, ax = plt.subplots(figsize=figsize)
957        plt.rc('xtick', labelsize = 30)
958        plt.rc('ytick', labelsize = 30)
959        ax.set_yscale('log')
960        ax.set_xlabel('time', fontsize = 25)
961        ax.set_ylabel('frequency', fontsize = 25)
962
963        log_counts = np.log10(traj_frame['Clone cumul freq'].values)
964        max_log_count = max(log_counts)
965        min_log_count = min(log_counts)
966
967        for id_, row in traj_frame.iterrows():
968            traj = row.drop(['Clone cumul freq']).to_numpy()
969            log_count = np.log10(row['Clone cumul freq'])
970            norm_log_count = (log_count-min_log_count)/(max_log_count-min_log_count)
971            plt.plot(self.times, traj, c=cmap(norm_log_count))
972
973
974        sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=min(log_counts), vmax=max(log_counts)))
975        cb = plt.colorbar(sm)
976        cb.set_label('Log10 cumulative frequency', fontsize = 25)
977
978        return fig, ax

Function to plot the trajectories of the first "n_top_clones". Colors of the trajectories represent the cumulative frequency in all the time points.

Parameters
  • n_top_clones (int): number of most abundant clontypes in each time point
  • colormap (str): colors of the trajectories
  • figsize (tuple): width, height in inches
Returns
  • ax (matplotlib.axes._subplots.AxesSubplot): axes where to draw the plot
  • fig (matplotlib.figure.Figure): matplotlib figure
def PCA_traj(self, n_top_clones, nclus=4):
 981    def PCA_traj(self, n_top_clones, nclus=4):
 982
 983        """
 984        Perform PCA over the normalized trajectories of n_top_clones TCR clones.
 985        The normalization consists in dividing the whole trajectory by its maximum value.
 986        After PCA the trajectories are clustered in the two principal componets space
 987        with a hierarchical clustering algorithm.
 988        
 989        Parameters
 990        ----------
 991        n_top_clones : int
 992            number of most abundant clontypes in each time point to consider in the PCA
 993        nclus : float
 994            number of clusters 
 995        
 996        Returns
 997        -------
 998        pca : sklearn.decomposition._pca.PCA
 999            object containing the result of the principal component analysis
1000            
1001        clustering : sklearn.cluster._agglomerative.AgglomerativeClustering
1002            object containing the result of the hierarchical clustering
1003        """
1004
1005        #Getting the top n_top_clones clonotypes at each time point
1006        top_clones = self.top_clones_set(n_top_clones)
1007        #Building a trajectory dataframe
1008        traj_frame = self.build_traj_frame(top_clones)
1009
1010        #Converting it in a numpy matrix
1011        traj_matrix = traj_frame.drop(['Clone cumul freq'], axis = 1).to_numpy()
1012
1013        # Normalize each trajectory by its maximum
1014        norm_traj_matrix = traj_matrix/np.max(traj_matrix, axis=1)[:, np.newaxis]
1015
1016        pca = PCA(n_components =2).fit(norm_traj_matrix.T)
1017        clustering = AgglomerativeClustering(n_clusters = nclus)
1018        clustering = clustering.fit(pca.components_.T)
1019
1020        return pca, clustering

Perform PCA over the normalized trajectories of n_top_clones TCR clones. The normalization consists in dividing the whole trajectory by its maximum value. After PCA the trajectories are clustered in the two principal componets space with a hierarchical clustering algorithm.

Parameters
  • n_top_clones (int): number of most abundant clontypes in each time point to consider in the PCA
  • nclus (float): number of clusters
Returns
  • pca (sklearn.decomposition._pca.PCA): object containing the result of the principal component analysis
  • clustering (sklearn.cluster._agglomerative.AgglomerativeClustering): object containing the result of the hierarchical clustering
def plot_PCA2(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12, 10)):
1023    def plot_PCA2(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)):
1024
1025        """
1026        Plotting the trajectories in the space of their two principal components and
1027        clustering them as in "PCA_traj".
1028        
1029        Parameters
1030        ----------
1031        n_top_clones : int
1032            number of most abundant clontypes in each time point to consider in the PCA
1033        nclus : float
1034            number of clusters 
1035        colormap : str
1036            colormap indicating the different clusters
1037        figsize : tuple
1038            width, height in inches
1039        Returns
1040        -------
1041        ax : matplotlib.axes._subplots.AxesSubplot
1042            axes where to draw the plot
1043        fig : matplotlib.figure.Figure
1044            matplotlib figure
1045        """
1046
1047
1048        cmap = cm.get_cmap(colormap)
1049        pca, clustering = self.PCA_traj(n_top_clones, nclus)
1050
1051        fig, ax = plt.subplots(figsize=figsize)
1052        ax.set_title('PCA components (%i trajs)' %pca.n_features_, fontsize = 25)
1053        ax.set_xlabel('First component (expl var: %3.2f)'%pca.explained_variance_ratio_[0], fontsize = 25)
1054        ax.set_ylabel('Second component (expl var: %3.2f)'%pca.explained_variance_ratio_[1], fontsize = 25)
1055        for c_ind in range(clustering.n_clusters):
1056            x = pca.components_[0][clustering.labels_ == c_ind]
1057            y = pca.components_[1][clustering.labels_ == c_ind]
1058            ax.scatter(x, y, alpha=0.2, color=cmap(c_ind/clustering.n_clusters))
1059        
1060        return fig, ax

Plotting the trajectories in the space of their two principal components and clustering them as in "PCA_traj".

Parameters
  • n_top_clones (int): number of most abundant clontypes in each time point to consider in the PCA
  • nclus (float): number of clusters
  • colormap (str): colormap indicating the different clusters
  • figsize (tuple): width, height in inches
Returns
  • ax (matplotlib.axes._subplots.AxesSubplot): axes where to draw the plot
  • fig (matplotlib.figure.Figure): matplotlib figure
def plot_PCA_clusters_traj(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12, 10)):
1063    def plot_PCA_clusters_traj(self, n_top_clones, nclus=4, colormap='viridis', figsize=(12,10)):
1064
1065        """
1066        Plotting the trajectories grouped by PCA clusters
1067        
1068        Parameters
1069        ----------
1070        n_top_clones : int
1071            number of most abundant clontypes in each time point to consider in the PCA
1072        nclus : float
1073            number of clusters 
1074        colormap : str
1075            colormap indicating the different clusters
1076        figsize : tuple
1077            width, height in inches
1078        Returns
1079        -------
1080        axs : tuple of matplotlib.axes._subplots.AxesSubplot
1081            axis where to draw the plot
1082        fig : matplotlib.figure.Figure
1083            matplotlib figure
1084        """
1085
1086        cmap = cm.get_cmap(colormap)
1087        pca, clustering = self.PCA_traj(n_top_clones, nclus)
1088
1089        n_cl = clustering.n_clusters
1090
1091        #Getting the top n_top_clones clonotypes at each time point
1092        top_clones = self.top_clones_set(n_top_clones)
1093        #Building a trajectory dataframe
1094        traj_frame = self.build_traj_frame(top_clones)
1095
1096        #Converting it in a numpy matrix
1097        traj_matrix = traj_frame.drop(['Clone cumul freq'], axis=1).to_numpy()
1098
1099        # Normalize each trajectory by its maximum
1100        norm_traj_matrix = traj_matrix/np.max(traj_matrix, axis=1)[:, np.newaxis]
1101
1102        fig, axs = plt.subplots(2, n_cl, figsize=(5*n_cl, 12))
1103        for cl in range(n_cl):
1104            trajs = norm_traj_matrix[clustering.labels_ == cl]
1105            axs[0][cl].set_xlabel('Time', fontsize = 15)
1106            axs[0][cl].set_ylabel('Normalized frequency', fontsize = 15)
1107            axs[1][cl].set_xlabel('Time', fontsize = 15)
1108            axs[1][cl].set_ylabel('Normalized frequency', fontsize = 15)
1109            for traj in trajs:
1110                axs[0][cl].plot(self.times, traj, alpha=0.2, color=cmap(cl/n_cl))
1111            axs[1][cl].set_ylim(0,1)
1112            axs[1][cl].errorbar(self.times, np.mean(trajs, axis=0), 
1113                                yerr=np.std(trajs, axis=0), lw=3, color=cmap(cl/n_cl))
1114            #axs[1][cl].fill_between(times, np.quantile(trajs, 0.75, axis=0), np.quantile(trajs, 0.25, axis=0), color=colors[cl])
1115               
1116        plt.tight_layout()
1117        return fig, axs

Plotting the trajectories grouped by PCA clusters

Parameters
  • n_top_clones (int): number of most abundant clontypes in each time point to consider in the PCA
  • nclus (float): number of clusters
  • colormap (str): colormap indicating the different clusters
  • figsize (tuple): width, height in inches
Returns
  • axs (tuple of matplotlib.axes._subplots.AxesSubplot): axis where to draw the plot
  • fig (matplotlib.figure.Figure): matplotlib figure
class Data_Process:
1121class Data_Process():
1122
1123    """
1124    A class used to represent longitudinal RepSeq data and pre-analysis of the longitudinal data associated with
1125    one individual.
1126    ...
1127    Attributes
1128    ----------
1129    path : str
1130        the name of the path to get access to the data files to use for our analysis
1131    filename1 : str
1132        the name of the file of the RepSeq sample which can be the first replicate when deciphering the experimental noise 
1133        or the first time point RepSeq sample when analysing responding clones to a stimulus between two time points.
1134    filename2 : str
1135        the name of the file of the RepSeq sample which can be the second replicate when deciphering the experimental noise 
1136        or the second time point RepSeq sample when analysing responding clones to a stimulus between two time points.
1137    colnames1 : str
1138        list of columns names of data-set - first sample
1139    colnames2 : str
1140        list of columns names of data-set - second sample 
1141    Methods
1142    -------
1143    import_data() : 
1144        to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the 
1145        union of both samples.
1146    
1147    """
1148
1149    def __init__(self, path, filename1, filename2, colnames1,  colnames2):
1150
1151        self.path = path
1152        self.filename1 = filename1
1153        self.filename2 = filename2
1154        self.colnames1 = colnames1
1155        self.colnames2 = colnames2
1156    
1157
1158    def import_data(self):
1159        """
1160        to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the union of both samples.
1161        
1162        Parameters
1163        ----------
1164        NONE
1165        Returns
1166        -------
1167        number_clones
1168            numpy array, number of clones in the data frame which is the union of the two RepSeq used as entries of the function
1169        df
1170            pandas data-frame which is the data-frame containing the informations labeled in colnames vector string
1171            for both RepSeq samples taken as input.
1172        """
1173
1174        mincount = 0
1175        maxcount = np.inf
1176        
1177        headerline=0 #line number of headerline
1178        newnames=['Clone_fraction','Clone_count','ntCDR3','AACDR3']   
1179
1180        if self.filename1[-2:] == 'gz':
1181            F1Frame_chunk=pd.read_csv(self.path + self.filename1, delimiter='\t',usecols=self.colnames1,header=headerline, compression = 'gzip')[self.colnames1]
1182        else:
1183            F1Frame_chunk=pd.read_csv(self.path + self.filename1, delimiter='\t',usecols=self.colnames1,header=headerline)[self.colnames1]
1184
1185        if self.filename2[-2:] == 'gz':
1186            F2Frame_chunk=pd.read_csv(self.path + self.filename2, delimiter='\t',usecols=self.colnames2,header=headerline, compression = 'gzip')[self.colnames2]
1187
1188        else:
1189            F2Frame_chunk=pd.read_csv(self.path + self.filename2, delimiter='\t',usecols=self.colnames2,header=headerline)[self.colnames2]
1190
1191        F1Frame_chunk.columns=newnames
1192        F2Frame_chunk.columns=newnames
1193        suffixes=('_1','_2')
1194        mergedFrame=pd.merge(F1Frame_chunk,F2Frame_chunk,on=newnames[2],suffixes=suffixes,how='outer')
1195        for nameit in [0,1]:
1196            for labelit in suffixes:
1197                mergedFrame.loc[:,newnames[nameit]+labelit].fillna(int(0),inplace=True)
1198                if nameit==1:
1199                    mergedFrame.loc[:,newnames[nameit]+labelit].astype(int)
1200        def dummy(x):
1201            val=x[0]
1202            if pd.isnull(val):
1203                val=x[1]    
1204            return val
1205        mergedFrame.loc[:,newnames[3]+suffixes[0]]=mergedFrame.loc[:,[newnames[3]+suffixes[0],newnames[3]+suffixes[1]]].apply(dummy,axis=1) #assigns AA sequence to clones, creates duplicates
1206        mergedFrame.drop(newnames[3]+suffixes[1], 1,inplace=True) #removes duplicates
1207        mergedFrame.rename(columns = {newnames[3]+suffixes[0]:newnames[3]}, inplace = True)
1208        mergedFrame=mergedFrame[[newname+suffix for newname in newnames[:2] for suffix in suffixes]+[newnames[2],newnames[3]]]
1209        filterout=((mergedFrame.Clone_count_1<mincount) & (mergedFrame.Clone_count_2==0)) | ((mergedFrame.Clone_count_2<mincount) & (mergedFrame.Clone_count_1==0)) #has effect only if mincount>0
1210        number_clones=len(mergedFrame)
1211        return number_clones,mergedFrame.loc[((mergedFrame.Clone_count_1<=maxcount) & (mergedFrame.Clone_count_2<=maxcount)) & ~filterout]

A class used to represent longitudinal RepSeq data and pre-analysis of the longitudinal data associated with one individual. ...

Attributes
  • path (str): the name of the path to get access to the data files to use for our analysis
  • filename1 (str): the name of the file of the RepSeq sample which can be the first replicate when deciphering the experimental noise or the first time point RepSeq sample when analysing responding clones to a stimulus between two time points.
  • filename2 (str): the name of the file of the RepSeq sample which can be the second replicate when deciphering the experimental noise or the second time point RepSeq sample when analysing responding clones to a stimulus between two time points.
  • colnames1 (str): list of columns names of data-set - first sample
  • colnames2 (str): list of columns names of data-set - second sample
Methods

import_data() : to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the union of both samples.

Data_Process(path, filename1, filename2, colnames1, colnames2)
1149    def __init__(self, path, filename1, filename2, colnames1,  colnames2):
1150
1151        self.path = path
1152        self.filename1 = filename1
1153        self.filename2 = filename2
1154        self.colnames1 = colnames1
1155        self.colnames2 = colnames2
def import_data(self):
1158    def import_data(self):
1159        """
1160        to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the union of both samples.
1161        
1162        Parameters
1163        ----------
1164        NONE
1165        Returns
1166        -------
1167        number_clones
1168            numpy array, number of clones in the data frame which is the union of the two RepSeq used as entries of the function
1169        df
1170            pandas data-frame which is the data-frame containing the informations labeled in colnames vector string
1171            for both RepSeq samples taken as input.
1172        """
1173
1174        mincount = 0
1175        maxcount = np.inf
1176        
1177        headerline=0 #line number of headerline
1178        newnames=['Clone_fraction','Clone_count','ntCDR3','AACDR3']   
1179
1180        if self.filename1[-2:] == 'gz':
1181            F1Frame_chunk=pd.read_csv(self.path + self.filename1, delimiter='\t',usecols=self.colnames1,header=headerline, compression = 'gzip')[self.colnames1]
1182        else:
1183            F1Frame_chunk=pd.read_csv(self.path + self.filename1, delimiter='\t',usecols=self.colnames1,header=headerline)[self.colnames1]
1184
1185        if self.filename2[-2:] == 'gz':
1186            F2Frame_chunk=pd.read_csv(self.path + self.filename2, delimiter='\t',usecols=self.colnames2,header=headerline, compression = 'gzip')[self.colnames2]
1187
1188        else:
1189            F2Frame_chunk=pd.read_csv(self.path + self.filename2, delimiter='\t',usecols=self.colnames2,header=headerline)[self.colnames2]
1190
1191        F1Frame_chunk.columns=newnames
1192        F2Frame_chunk.columns=newnames
1193        suffixes=('_1','_2')
1194        mergedFrame=pd.merge(F1Frame_chunk,F2Frame_chunk,on=newnames[2],suffixes=suffixes,how='outer')
1195        for nameit in [0,1]:
1196            for labelit in suffixes:
1197                mergedFrame.loc[:,newnames[nameit]+labelit].fillna(int(0),inplace=True)
1198                if nameit==1:
1199                    mergedFrame.loc[:,newnames[nameit]+labelit].astype(int)
1200        def dummy(x):
1201            val=x[0]
1202            if pd.isnull(val):
1203                val=x[1]    
1204            return val
1205        mergedFrame.loc[:,newnames[3]+suffixes[0]]=mergedFrame.loc[:,[newnames[3]+suffixes[0],newnames[3]+suffixes[1]]].apply(dummy,axis=1) #assigns AA sequence to clones, creates duplicates
1206        mergedFrame.drop(newnames[3]+suffixes[1], 1,inplace=True) #removes duplicates
1207        mergedFrame.rename(columns = {newnames[3]+suffixes[0]:newnames[3]}, inplace = True)
1208        mergedFrame=mergedFrame[[newname+suffix for newname in newnames[:2] for suffix in suffixes]+[newnames[2],newnames[3]]]
1209        filterout=((mergedFrame.Clone_count_1<mincount) & (mergedFrame.Clone_count_2==0)) | ((mergedFrame.Clone_count_2<mincount) & (mergedFrame.Clone_count_1==0)) #has effect only if mincount>0
1210        number_clones=len(mergedFrame)
1211        return number_clones,mergedFrame.loc[((mergedFrame.Clone_count_1<=maxcount) & (mergedFrame.Clone_count_2<=maxcount)) & ~filterout]

to import and merged two RepSeq samples and build a unique data-frame with frequencies and abundances of all TCR clones present in the union of both samples.

Parameters
  • NONE
Returns
  • number_clones: numpy array, number of clones in the data frame which is the union of the two RepSeq used as entries of the function
  • df: pandas data-frame which is the data-frame containing the informations labeled in colnames vector string for both RepSeq samples taken as input.
class Noise_Model:
1219class Noise_Model():
1220
1221    """
1222    A class used to build an object associated to methods in order to learn the experimental noise from same day 
1223    biological RepSeq samples.
1224    ...
1225    Methods
1226    -------
1227    get_sparserep(df) :
1228        get sparse representation of the abundances / frequencies of the TCR clones present in both RepSeq samples of interest.
1229        this changes the data input to fasten the algorithm
1230    learn_null_model(df, noise_model, init_paras,  output_dir = None, filename = None, display_loss_function = False) :
1231        function to optimize the likelihood associated to the experimental noise model and get the associated parameters.
1232    diversity_estimate(df, paras, noise_model) :
1233        function to get the estimation of diversity from the noise model information.
1234    """
1235
1236
1237    def get_sparserep(self, df): 
1238        """
1239        Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation.
1240        unicountvals_1(2) are the unique values of n1(2).
1241        sparse_rep_counts gives the counts of unique pairs.
1242        ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair.
1243        len(indn1)=len(indn2)=len(sparse_rep_counts)
1244        Parameters
1245        ----------
1246        df : pandas data frame
1247            data-frame which is the output of the method .import_data() for one Data_Process instance.
1248            these data-frame should give the list of TCR clones present in two replicates RepSeq samples
1249            associated to their clone frequencies and clone abundances in the first and second replicate.
1250        Returns
1251        -------
1252        indn1
1253            numpy array list of indexes of all values of unicountvals_1
1254        indn2
1255            numpy array list of indexes of all values of unicountvals_2
1256        sparse_rep_counts
1257            numpy array, # of clones having the read counts pair {(n1,n2)} 
1258        unicountvals_1
1259            numpy array list of unique counts values present in the first sample in df[clone_count_1]
1260        unicountvals_2
1261            numpy array list of unique counts values present in the second sample in df[clone_count_2]
1262        Nreads1
1263            float, total number of counts/reads in the first sample referred in df by "_1"
1264        Nreads2
1265            float, total number of counts/reads in the second sample referred in df by "_2"
1266        """
1267        
1268        counts = df.loc[:,['Clone_count_1', 'Clone_count_2']]
1269        counts['paircount'] = 1  # gives a weight of 1 to each observed clone
1270
1271        clone_counts = counts.groupby(['Clone_count_1', 'Clone_count_2']).sum()
1272        sparse_rep_counts = np.asarray(clone_counts.values.flatten(), dtype=int)
1273        clonecountpair_vals = clone_counts.index.values
1274        indn1 = np.asarray([clonecountpair_vals[it][0] for it in range(len(sparse_rep_counts))], dtype=int)
1275        indn2 = np.asarray([clonecountpair_vals[it][1] for it in range(len(sparse_rep_counts))], dtype=int)
1276        NreadsI = np.sum(counts['Clone_count_1'])
1277        NreadsII = np.sum(counts['Clone_count_2'])
1278
1279        unicountvals_1, indn1 = np.unique(indn1, return_inverse=True)
1280        unicountvals_2, indn2 = np.unique(indn2, return_inverse=True)
1281
1282        return indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII
1283
1284
1285
1286    def _NegBinPar(self,m,v,mvec): 
1287        '''
1288        Same as NegBinParMtr, but for m and v being scalars.
1289        Assumes m>0.
1290        Output is (len(mvec),) array
1291        '''
1292        mmax=mvec[-1]
1293        p = 1-m/v
1294        r = m*m/v/p
1295        NBvec=np.arange(mmax+1,dtype=float)   
1296        NBvec[1:]=np.log((NBvec[1:]+r-1)/NBvec[1:]*p) #vectorization won't help unfortuneately here since log needs to be over array
1297        NBvec[0]=r*math.log(m/v)
1298        NBvec=np.exp(np.cumsum(NBvec)[mvec]) #save a bit here
1299        return NBvec
1300
1301    def _NegBinParMtr(self,m,v,nvec): #speed up only insofar as the log and exp are called once on array instead of multiple times on rows
1302        ''' 
1303        computes NegBin probabilities over the ordered (but possibly discontiguous) vector (nvec) 
1304        for mean/variance combinations given by the mean (m) and variance (v) vectors. 
1305        Note that m<v for negative binomial.
1306        Output is (len(m),len(nvec)) array
1307        '''
1308        nmax=nvec[-1]
1309        p = 1-m/v
1310        r = m*m/v/p
1311        NBvec=np.arange(nmax+1,dtype=float)
1312        NBvec=np.log((NBvec+r[:,np.newaxis]-1)*(p[:,np.newaxis]/NBvec))
1313        NBvec[:,0]=r*np.log(m/v) #handle NBvec[0]=0, treated specially when m[0]=0, see below
1314        NBvec=np.exp(np.cumsum(NBvec,axis=1)) #save a bit here
1315        if m[0]==0:
1316            NBvec[0,:]=0.
1317            NBvec[0,0]=1.
1318        NBvec=NBvec[:,nvec]
1319        return NBvec
1320
1321    def _PoisPar(self, Mvec,unicountvals):
1322        #assert Mvec[0]==0, "first element needs to be zero"
1323        nmax=unicountvals[-1]
1324        nlen=len(unicountvals)
1325        mlen=len(Mvec)
1326        Nvec=unicountvals
1327        logNvec=-np.insert(np.cumsum(np.log(np.arange(1,nmax+1))),0,0.)[unicountvals] #avoid n=0 nans  
1328        Nmtr=np.exp(Nvec[np.newaxis,:]*np.log(Mvec)[:,np.newaxis]+logNvec[np.newaxis,:]-Mvec[:,np.newaxis]) # np.log(Mvec) throws warning: since log(0)=-inf
1329        if Mvec[0]==0:
1330            Nmtr[0,:]=np.zeros((nlen,)) #when m=0, n=0, and so get rid of nans from log(0)
1331            Nmtr[0,0]=1. #handled belowacq_model_type
1332        if unicountvals[0]==0: #if n=0 included get rid of nans from log(0)
1333            Nmtr[:,0]=np.exp(-Mvec)
1334        return Nmtr
1335
1336    def _get_rhof(self,alpha_rho, nfbins,fmin,freq_dtype):
1337        '''
1338        generates power law (power is alpha_rho) clone frequency distribution over 
1339        freq_nbins discrete logarithmically spaced frequences between fmin and 1 of dtype freq_dtype
1340        Outputs log probabilities obtained at log frequencies'''
1341        fmax=1e0
1342        logfvec=np.linspace(np.log10(fmin),np.log10(fmax), nfbins)
1343        logfvec=np.array(np.log(np.power(10,logfvec)) ,dtype=freq_dtype).flatten()  
1344        logrhovec=logfvec*alpha_rho
1345        integ=np.exp(logrhovec+logfvec,dtype=freq_dtype)
1346        normconst=np.log(np.dot(np.diff(logfvec)/2.,integ[1:]+integ[:-1]))
1347        logrhovec-=normconst 
1348        return logrhovec,logfvec, normconst
1349
1350
1351    def _get_logPn_f(self,unicounts,Nreads,logfvec, noise_model, paras):
1352
1353        """
1354        tools to compute the likelihood of the noise model. It is not useful for the user.
1355        """
1356
1357        # Choice of the model:
1358        
1359        if noise_model<1:
1360
1361            m_total=float(np.power(10, paras[3])) 
1362            r_c=Nreads/m_total
1363        if noise_model<2:
1364
1365            beta_mv= paras[1]
1366            alpha_mv=paras[2]
1367            
1368        if noise_model<1: #for models that include cell counts
1369            #compute parametrized range (mean-sigma,mean+5*sigma) of m values (number of cells) conditioned on n values (reads) appearing in the data only 
1370            nsigma=5.
1371            nmin=300.
1372            #for each n, get actual range of m to compute around n-dependent mean m
1373            m_low =np.zeros((len(unicounts),),dtype=int)
1374            m_high=np.zeros((len(unicounts),),dtype=int)
1375            for nit,n in enumerate(unicounts):
1376                mean_m=n/r_c
1377                dev=nsigma*np.sqrt(mean_m)
1378                m_low[nit] =int(mean_m-  dev) if (mean_m>dev**2) else 0                         
1379                m_high[nit]=int(mean_m+5*dev) if (      n>nmin) else int(10*nmin/r_c)
1380            m_cellmax=np.max(m_high)
1381            #across n, collect all in-range m
1382            mvec_bool=np.zeros((m_cellmax+1,),dtype=bool) #cheap bool
1383            nvec=range(len(unicounts))
1384            for nit in nvec:
1385                mvec_bool[m_low[nit]:m_high[nit]+1]=True  #mask vector
1386            mvec=np.arange(m_cellmax+1)[mvec_bool]                
1387            #transform to in-range index
1388            for nit in nvec:
1389                m_low[nit]=np.where(m_low[nit]==mvec)[0][0]
1390                m_high[nit]=np.where(m_high[nit]==mvec)[0][0]
1391
1392        Pn_f=np.zeros((len(logfvec),len(unicounts)))
1393        if noise_model==0:
1394
1395            mean_m=m_total*np.exp(logfvec)
1396            var_m=mean_m+beta_mv*np.power(mean_m,alpha_mv)
1397            Poisvec = self._PoisPar(mvec*r_c,unicounts)
1398            for f_it in range(len(logfvec)):
1399                NBvec=self._NegBinPar(mean_m[f_it],var_m[f_it],mvec)
1400                for n_it,n in enumerate(unicounts):
1401                    Pn_f[f_it,n_it]=np.dot(NBvec[m_low[n_it]:m_high[n_it]+1],Poisvec[m_low[n_it]:m_high[n_it]+1,n_it]) 
1402        
1403        elif noise_model==1:
1404
1405            mean_n=Nreads*np.exp(logfvec)
1406            var_n=mean_n+beta_mv*np.power(mean_n,alpha_mv)
1407            Pn_f = self._NegBinParMtr(mean_n,var_n,unicounts)
1408        elif noise_model==2:
1409
1410            mean_n=Nreads*np.exp(logfvec)
1411            Pn_f= self._PoisPar(mean_n,unicounts)
1412        else:
1413            print('acq_model is 0,1, or 2 only')
1414
1415        return np.log(Pn_f)
1416
1417    #-----------------------------Null-Model-optimization--------------------------
1418        
1419    def _get_Pn1n2(self, paras, sparse_rep, noise_model):
1420
1421        """
1422        Tool to compute likelihood of the noise model. It is not useful for the user.
1423        """
1424
1425        indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2,NreadsI,NreadsII = sparse_rep
1426            
1427        nfbins = 1200
1428        freq_dtype = float
1429
1430        # Parameters
1431
1432        alpha = paras[0]
1433        fmin = np.power(10,paras[-1])
1434
1435        # 
1436        logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype)
1437
1438        # 
1439
1440        logfvec_tmp=deepcopy(logfvec)
1441
1442        logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI,logfvec_tmp, noise_model, paras)
1443        logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII,logfvec_tmp, noise_model, paras)
1444
1445        # for the trapezoid integral methods
1446
1447        dlogfby2=np.diff(logfvec)/2
1448
1449        # Compute P(0,0) for the normalization constraint
1450        integ = np.exp(logrhofvec + logPn2_f[:, 0] + logPn1_f[:, 0] + logfvec)
1451        Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1])
1452
1453        #print("computing P(n1,n2)")
1454        Pn1n2 = np.zeros(len(sparse_rep_counts))  # 1D representation
1455        for it, (ind1, ind2) in enumerate(zip(indn1, indn2)):
1456            integ = np.exp(logPn1_f[:, ind1] + logrhofvec + logPn2_f[:, ind2] + logfvec)
1457            Pn1n2[it] = np.dot(dlogfby2, integ[1:] + integ[:-1])
1458        Pn1n2 /= 1. - Pn0n0  # renormalize
1459        return -np.dot(sparse_rep_counts, np.where(Pn1n2 > 0, np.log(Pn1n2), 0)) / float(np.sum(sparse_rep_counts))
1460
1461    
1462
1463
1464    def _callback(self, paras, nparas, sparse_rep, noise_model):
1465        '''prints iteration info. called by scipy.minimize. Not useful for the user.'''
1466
1467        global curr_iter
1468        #curr_iter = 0
1469        global Loss_function 
1470        print(''.join(['{0:d} ']+['{'+str(it)+':3.6f} ' for it in range(1,len(paras)+1)]).format(*([curr_iter]+list(paras))))
1471        #print ('{' + str(len(paras)+1) + ':3.6f}'.format( [self.get_Pn1n2(paras, sparse_rep, acq_model_type)]))
1472        Loss_function = self._get_Pn1n2(paras, sparse_rep, noise_model)
1473        print(Loss_function)
1474        curr_iter += 1
1475        
1476
1477
1478    # Constraints for the Null-Model, no filtered 
1479    def _nullmodel_constr_fn(self, paras, sparse_rep, noise_model, constr_type):
1480            
1481        '''
1482        returns either or both of the two level-set functions: log<f>-log(1/N), with N=Nclones/(1-P(0,0)) and log(Z_f), with Z_f=N<f>_{n+n'=0} + sum_i^Nclones <f>_{f|n,n'}
1483        not useful for the user
1484        '''
1485
1486        # Choice of the model: 
1487
1488        indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII = sparse_rep
1489
1490        #Variables that would be chosen in the future by the user 
1491        nfbins = 1200
1492        freq_dtype = float
1493
1494        alpha = paras[0]  # power law exponent
1495        fmin = np.power(10, paras[-1]) # true minimal frequency 
1496
1497        logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype)
1498        dlogfby2 = np.diff(logfvec) / 2.  # 1/2 comes from trapezoid integration below
1499
1500        integ = np.exp(logrhofvec + 2 * logfvec)
1501        avgf_ps = np.dot(dlogfby2, integ[:-1] + integ[1:])
1502
1503        logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI, logfvec, noise_model, paras)
1504        logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII, logfvec, noise_model, paras)
1505
1506        integ = np.exp(logPn1_f[:, 0] + logPn2_f[:, 0] + logrhofvec + logfvec)
1507        Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1])
1508        logPnng0 = np.log(1 - Pn0n0)
1509        avgf_null_pair = np.exp(logPnng0 - np.log(np.sum(sparse_rep_counts)))
1510
1511        C1 = np.log(avgf_ps) - np.log(avgf_null_pair)
1512
1513        integ = np.exp(logPn1_f[:, 0] + logPn2_f[:, 0] + logrhofvec + 2 * logfvec)
1514        log_avgf_n0n0 = np.log(np.dot(dlogfby2, integ[1:] + integ[:-1]))
1515
1516        integ = np.exp(logPn1_f[:, indn1] + logPn2_f[:, indn2] + logrhofvec[:, np.newaxis] + logfvec[:, np.newaxis])
1517        log_Pn1n2 = np.log(np.sum(dlogfby2[:, np.newaxis] * (integ[1:, :] + integ[:-1, :]), axis=0))
1518        integ = np.exp(np.log(integ) + logfvec[:, np.newaxis])
1519        tmp = deepcopy(log_Pn1n2)
1520        tmp[tmp == -np.Inf] = np.Inf  # since subtracted in next line
1521        avgf_n1n2 = np.exp(np.log(np.sum(dlogfby2[:, np.newaxis] * (integ[1:, :] + integ[:-1, :]), axis=0)) - tmp)
1522        log_sumavgf = np.log(np.dot(sparse_rep_counts, avgf_n1n2))
1523
1524        logNclones = np.log(np.sum(sparse_rep_counts)) - logPnng0
1525        Z = np.exp(logNclones + np.log(Pn0n0) + log_avgf_n0n0) + np.exp(log_sumavgf)
1526
1527        C2 = np.log(Z)
1528
1529        
1530        # print('C1:'+str(C1)+' C2:'+str(C2))
1531        if constr_type == 0:
1532            return C1
1533        elif constr_type == 1:
1534            return C2
1535        else:
1536            return C1, C2
1537
1538
1539        
1540    # Null-Model optimization learning 
1541
1542    def learn_null_model(self, df, noise_model, init_paras,  output_dir = None, filename = None, display_loss_function = False):  # constraint type 1 gives only low error modes, see paper for details.
1543        """
1544        Parameters
1545        ----------
1546        df : pandas data frame
1547            data-frame which is the output of the method .import_data() for one Data_Process instance.
1548            these data-frame should give the list of TCR clones present in two replicates RepSeq samples
1549            associated to their clone frequencies and clone abundances in the first and second replicate.
1550        noise_model: numpy array
1551            choice of noise model 
1552        init_paras: numpy array
1553            initial vector of parameters to start the optimization of the model from data (df)
1554        output_dir : str
1555            default value is None, it is the output directory name i which we want to save the values of the parameters
1556        display_loss_function : bool
1557            boolean variable to chose if we want to print the loss function during the experimental noise learning, default value is 
1558            None.
1559        
1560        Returns
1561        -------
1562        outstruct
1563            numpy array parameters of the noise model
1564        constr_value
1565            float, value of the constraint 
1566    
1567        """
1568            
1569        # Data introduction
1570        sparse_rep = self.get_sparserep(df)
1571        constr_type = 1
1572
1573        # Choice of the model:
1574        # Parameters initialization depending on the model 
1575        if noise_model < 1:
1576            parameter_labels = ['alph_rho', 'beta', 'alpha', 'm_total', 'fmin']
1577        elif noise_model == 1:
1578            parameter_labels = ['alph_rho', 'beta', 'alpha', 'fmin']
1579        else:
1580            parameter_labels = ['alph_rho', 'fmin']
1581
1582        assert len(parameter_labels) == len(init_paras), "number of model and initial paras differ!"
1583
1584        condict = {'type': 'eq', 'fun': self._nullmodel_constr_fn, 'args': (sparse_rep, noise_model, constr_type)}
1585
1586
1587        partialobjfunc = partial(self._get_Pn1n2, sparse_rep=sparse_rep, noise_model=noise_model)
1588        nullfunctol = 1e-6
1589        nullmaxiter = 200
1590        header = ['Iter'] + parameter_labels
1591        print(''.join(['{' + str(it) + ':9s} ' for it in range(len(init_paras) + 1)]).format(*header))
1592            
1593        global curr_iter
1594        curr_iter = 1
1595        callbackp = partial(self._callback, nparas=len(init_paras), sparse_rep = sparse_rep, noise_model= noise_model)
1596        outstruct = minimize(partialobjfunc, init_paras, method='SLSQP', callback=callbackp, constraints=condict,
1597                        options={'ftol': nullfunctol, 'disp': True, 'maxiter': nullmaxiter})
1598            
1599        constr_value = self._nullmodel_constr_fn(outstruct.x, sparse_rep, noise_model, constr_type)
1600
1601        if noise_model < 1:
1602            parameter_labels = ['alph_rho', 'beta', 'alpha', 'm_total', 'fmin']
1603            d = {'label' : parameter_labels, 'value': outstruct.x}
1604            df = pd.DataFrame(data = d)
1605        elif noise_model == 1:
1606            parameter_labels = ['alph_rho', 'beta', 'alpha', 'fmin']
1607            d = {'label' : parameter_labels, 'value': outstruct.x}
1608            df = pd.DataFrame(data = d)
1609        else:
1610            parameter_labels = ['alph_rho', 'fmin']
1611            d = {'label' : parameter_labels, 'value': outstruct.x}
1612            df = pd.DataFrame(data = d)
1613
1614
1615        if (output_dir == None) & (filename == None):
1616            df.to_csv('nullpara' + str(noise_model)+ '.txt', sep = '\t')
1617
1618        elif (output_dir != None) & (filename == None):
1619            df.to_csv(output_dir + '/nullpara' + str(noise_model)+ '.txt', sep = '\t')
1620
1621        else :
1622            df.to_csv(output_dir + '/' + filename + '.txt', sep = '\t')
1623
1624        return outstruct, constr_value
1625
1626    def diversity_estimate(self, df, paras, noise_model):
1627
1628        """
1629        Estimate diversity of the individual repertoire from the experimental noise learning step. 
1630        Parameters
1631        ----------
1632        df : data-frame 
1633            The data-frame which has been used to learn the noise model
1634        paras : numpy array
1635            vector containing the noise parameters
1636        noise_model : int
1637            choice of noise model 
1638        Returns
1639        -------
1640        diversity_estimate
1641            float, diversity estimate from the noise model inference.
1642    
1643        """
1644
1645        sparse_rep = self.get_sparserep(df)
1646
1647        indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2, NreadsI, NreadsII = sparse_rep
1648            
1649        nfbins = 1200
1650        freq_dtype = float
1651
1652        # Parameters
1653
1654        alpha = paras[0]
1655        fmin = np.power(10,paras[-1])
1656
1657        # 
1658        logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype)
1659
1660        # 
1661
1662        logfvec_tmp=deepcopy(logfvec)
1663
1664        logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI,logfvec_tmp, noise_model, paras)
1665        logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII,logfvec_tmp, noise_model, paras)
1666
1667        # for the trapezoid integral methods
1668
1669        dlogfby2=np.diff(logfvec)/2
1670
1671        # Compute P(0,0) for the normalization constraint
1672        integ = np.exp(logrhofvec + logPn2_f[:, 0] + logPn1_f[:, 0] + logfvec)
1673        Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1])
1674
1675        #print(np.sum(sparse_rep_counts))
1676        N_obs = np.sum(sparse_rep_counts)
1677
1678        return int(N_obs/(1-Pn0n0))

A class used to build an object associated to methods in order to learn the experimental noise from same day biological RepSeq samples. ...

Methods

get_sparserep(df) : get sparse representation of the abundances / frequencies of the TCR clones present in both RepSeq samples of interest. this changes the data input to fasten the algorithm learn_null_model(df, noise_model, init_paras, output_dir = None, filename = None, display_loss_function = False) : function to optimize the likelihood associated to the experimental noise model and get the associated parameters. diversity_estimate(df, paras, noise_model) : function to get the estimation of diversity from the noise model information.

Noise_Model()
def get_sparserep(self, df):
1237    def get_sparserep(self, df): 
1238        """
1239        Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation.
1240        unicountvals_1(2) are the unique values of n1(2).
1241        sparse_rep_counts gives the counts of unique pairs.
1242        ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair.
1243        len(indn1)=len(indn2)=len(sparse_rep_counts)
1244        Parameters
1245        ----------
1246        df : pandas data frame
1247            data-frame which is the output of the method .import_data() for one Data_Process instance.
1248            these data-frame should give the list of TCR clones present in two replicates RepSeq samples
1249            associated to their clone frequencies and clone abundances in the first and second replicate.
1250        Returns
1251        -------
1252        indn1
1253            numpy array list of indexes of all values of unicountvals_1
1254        indn2
1255            numpy array list of indexes of all values of unicountvals_2
1256        sparse_rep_counts
1257            numpy array, # of clones having the read counts pair {(n1,n2)} 
1258        unicountvals_1
1259            numpy array list of unique counts values present in the first sample in df[clone_count_1]
1260        unicountvals_2
1261            numpy array list of unique counts values present in the second sample in df[clone_count_2]
1262        Nreads1
1263            float, total number of counts/reads in the first sample referred in df by "_1"
1264        Nreads2
1265            float, total number of counts/reads in the second sample referred in df by "_2"
1266        """
1267        
1268        counts = df.loc[:,['Clone_count_1', 'Clone_count_2']]
1269        counts['paircount'] = 1  # gives a weight of 1 to each observed clone
1270
1271        clone_counts = counts.groupby(['Clone_count_1', 'Clone_count_2']).sum()
1272        sparse_rep_counts = np.asarray(clone_counts.values.flatten(), dtype=int)
1273        clonecountpair_vals = clone_counts.index.values
1274        indn1 = np.asarray([clonecountpair_vals[it][0] for it in range(len(sparse_rep_counts))], dtype=int)
1275        indn2 = np.asarray([clonecountpair_vals[it][1] for it in range(len(sparse_rep_counts))], dtype=int)
1276        NreadsI = np.sum(counts['Clone_count_1'])
1277        NreadsII = np.sum(counts['Clone_count_2'])
1278
1279        unicountvals_1, indn1 = np.unique(indn1, return_inverse=True)
1280        unicountvals_2, indn2 = np.unique(indn2, return_inverse=True)
1281
1282        return indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII

Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation. unicountvals_1(2) are the unique values of n1(2). sparse_rep_counts gives the counts of unique pairs. ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair. len(indn1)=len(indn2)=len(sparse_rep_counts)

Parameters
  • df (pandas data frame): data-frame which is the output of the method .import_data() for one Data_Process instance. these data-frame should give the list of TCR clones present in two replicates RepSeq samples associated to their clone frequencies and clone abundances in the first and second replicate.
Returns
  • indn1: numpy array list of indexes of all values of unicountvals_1
  • indn2: numpy array list of indexes of all values of unicountvals_2
  • sparse_rep_counts: numpy array, # of clones having the read counts pair {(n1,n2)}
  • unicountvals_1: numpy array list of unique counts values present in the first sample in df[clone_count_1]
  • unicountvals_2: numpy array list of unique counts values present in the second sample in df[clone_count_2]
  • Nreads1: float, total number of counts/reads in the first sample referred in df by "_1"
  • Nreads2: float, total number of counts/reads in the second sample referred in df by "_2"
def learn_null_model( self, df, noise_model, init_paras, output_dir=None, filename=None, display_loss_function=False):
1542    def learn_null_model(self, df, noise_model, init_paras,  output_dir = None, filename = None, display_loss_function = False):  # constraint type 1 gives only low error modes, see paper for details.
1543        """
1544        Parameters
1545        ----------
1546        df : pandas data frame
1547            data-frame which is the output of the method .import_data() for one Data_Process instance.
1548            these data-frame should give the list of TCR clones present in two replicates RepSeq samples
1549            associated to their clone frequencies and clone abundances in the first and second replicate.
1550        noise_model: numpy array
1551            choice of noise model 
1552        init_paras: numpy array
1553            initial vector of parameters to start the optimization of the model from data (df)
1554        output_dir : str
1555            default value is None, it is the output directory name i which we want to save the values of the parameters
1556        display_loss_function : bool
1557            boolean variable to chose if we want to print the loss function during the experimental noise learning, default value is 
1558            None.
1559        
1560        Returns
1561        -------
1562        outstruct
1563            numpy array parameters of the noise model
1564        constr_value
1565            float, value of the constraint 
1566    
1567        """
1568            
1569        # Data introduction
1570        sparse_rep = self.get_sparserep(df)
1571        constr_type = 1
1572
1573        # Choice of the model:
1574        # Parameters initialization depending on the model 
1575        if noise_model < 1:
1576            parameter_labels = ['alph_rho', 'beta', 'alpha', 'm_total', 'fmin']
1577        elif noise_model == 1:
1578            parameter_labels = ['alph_rho', 'beta', 'alpha', 'fmin']
1579        else:
1580            parameter_labels = ['alph_rho', 'fmin']
1581
1582        assert len(parameter_labels) == len(init_paras), "number of model and initial paras differ!"
1583
1584        condict = {'type': 'eq', 'fun': self._nullmodel_constr_fn, 'args': (sparse_rep, noise_model, constr_type)}
1585
1586
1587        partialobjfunc = partial(self._get_Pn1n2, sparse_rep=sparse_rep, noise_model=noise_model)
1588        nullfunctol = 1e-6
1589        nullmaxiter = 200
1590        header = ['Iter'] + parameter_labels
1591        print(''.join(['{' + str(it) + ':9s} ' for it in range(len(init_paras) + 1)]).format(*header))
1592            
1593        global curr_iter
1594        curr_iter = 1
1595        callbackp = partial(self._callback, nparas=len(init_paras), sparse_rep = sparse_rep, noise_model= noise_model)
1596        outstruct = minimize(partialobjfunc, init_paras, method='SLSQP', callback=callbackp, constraints=condict,
1597                        options={'ftol': nullfunctol, 'disp': True, 'maxiter': nullmaxiter})
1598            
1599        constr_value = self._nullmodel_constr_fn(outstruct.x, sparse_rep, noise_model, constr_type)
1600
1601        if noise_model < 1:
1602            parameter_labels = ['alph_rho', 'beta', 'alpha', 'm_total', 'fmin']
1603            d = {'label' : parameter_labels, 'value': outstruct.x}
1604            df = pd.DataFrame(data = d)
1605        elif noise_model == 1:
1606            parameter_labels = ['alph_rho', 'beta', 'alpha', 'fmin']
1607            d = {'label' : parameter_labels, 'value': outstruct.x}
1608            df = pd.DataFrame(data = d)
1609        else:
1610            parameter_labels = ['alph_rho', 'fmin']
1611            d = {'label' : parameter_labels, 'value': outstruct.x}
1612            df = pd.DataFrame(data = d)
1613
1614
1615        if (output_dir == None) & (filename == None):
1616            df.to_csv('nullpara' + str(noise_model)+ '.txt', sep = '\t')
1617
1618        elif (output_dir != None) & (filename == None):
1619            df.to_csv(output_dir + '/nullpara' + str(noise_model)+ '.txt', sep = '\t')
1620
1621        else :
1622            df.to_csv(output_dir + '/' + filename + '.txt', sep = '\t')
1623
1624        return outstruct, constr_value
Parameters
  • df (pandas data frame): data-frame which is the output of the method .import_data() for one Data_Process instance. these data-frame should give the list of TCR clones present in two replicates RepSeq samples associated to their clone frequencies and clone abundances in the first and second replicate.
  • noise_model (numpy array): choice of noise model
  • init_paras (numpy array): initial vector of parameters to start the optimization of the model from data (df)
  • output_dir (str): default value is None, it is the output directory name i which we want to save the values of the parameters
  • display_loss_function (bool): boolean variable to chose if we want to print the loss function during the experimental noise learning, default value is None.
Returns
  • outstruct: numpy array parameters of the noise model
  • constr_value: float, value of the constraint
def diversity_estimate(self, df, paras, noise_model):
1626    def diversity_estimate(self, df, paras, noise_model):
1627
1628        """
1629        Estimate diversity of the individual repertoire from the experimental noise learning step. 
1630        Parameters
1631        ----------
1632        df : data-frame 
1633            The data-frame which has been used to learn the noise model
1634        paras : numpy array
1635            vector containing the noise parameters
1636        noise_model : int
1637            choice of noise model 
1638        Returns
1639        -------
1640        diversity_estimate
1641            float, diversity estimate from the noise model inference.
1642    
1643        """
1644
1645        sparse_rep = self.get_sparserep(df)
1646
1647        indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2, NreadsI, NreadsII = sparse_rep
1648            
1649        nfbins = 1200
1650        freq_dtype = float
1651
1652        # Parameters
1653
1654        alpha = paras[0]
1655        fmin = np.power(10,paras[-1])
1656
1657        # 
1658        logrhofvec, logfvec, normconst = self._get_rhof(alpha,nfbins,fmin,freq_dtype)
1659
1660        # 
1661
1662        logfvec_tmp=deepcopy(logfvec)
1663
1664        logPn1_f = self._get_logPn_f(unicountvals_1, NreadsI,logfvec_tmp, noise_model, paras)
1665        logPn2_f = self._get_logPn_f(unicountvals_2, NreadsII,logfvec_tmp, noise_model, paras)
1666
1667        # for the trapezoid integral methods
1668
1669        dlogfby2=np.diff(logfvec)/2
1670
1671        # Compute P(0,0) for the normalization constraint
1672        integ = np.exp(logrhofvec + logPn2_f[:, 0] + logPn1_f[:, 0] + logfvec)
1673        Pn0n0 = np.dot(dlogfby2, integ[1:] + integ[:-1])
1674
1675        #print(np.sum(sparse_rep_counts))
1676        N_obs = np.sum(sparse_rep_counts)
1677
1678        return int(N_obs/(1-Pn0n0))

Estimate diversity of the individual repertoire from the experimental noise learning step.

Parameters
  • df (data-frame): The data-frame which has been used to learn the noise model
  • paras (numpy array): vector containing the noise parameters
  • noise_model (int): choice of noise model
Returns
  • diversity_estimate: float, diversity estimate from the noise model inference.
class Expansion_Model:
1683class Expansion_Model():
1684    
1685    """
1686    A class used to build an object associated to methods in order to select significant expanding or 
1687    contracting clones from RepSeq samples taken at two different time points.
1688    ...
1689    Methods
1690    -------
1691    get_sparserep(df) :
1692        get sparse representation of the abundances / frequencies of the TCR clones present in RepSeq samples of both time points.
1693        This changes the data input to fasten the algorithm
1694    expansion_table(outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold):
1695        generate the table of clones that have been significantly detected to be responsive to an acute stimuli.
1696    """
1697
1698
1699    def get_sparserep(self, df): 
1700        """
1701        Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation.
1702        unicountvals_1(2) are the unique values of n1(2).
1703        sparse_rep_counts gives the counts of unique pairs.
1704        ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair.
1705        len(indn1)=len(indn2)=len(sparse_rep_counts)
1706        Parameters
1707        ----------
1708        df : pandas data frame
1709            data-frame which is the output of the method .import_data() for one Data_Process instance.
1710            these data-frame should give the list of TCR clones present in two RepSeq samples, talen at two 
1711            different time points, associated to their clone frequencies and clone abundances in the first and second replicate?
1712        Returns
1713        -------
1714        indn1
1715            numpy array list of indexes of all values of unicountvals_1
1716        indn2
1717            numpy array list of indexes of all values of unicountvals_2
1718        sparse_rep_counts
1719            numpy array, # of clones having the read counts pair {(n1,n2)} 
1720        unicountvals_1
1721            numpy array list of unique counts values present in the first sample in df[clone_count_1]
1722        unicountvals_2
1723            numpy array list of unique counts values present in the second sample in df[clone_count_2]
1724        Nreads1
1725            float, total number of counts/reads in the first sample referred in df by "_1" for first time point
1726        Nreads2
1727            float, total number of counts/reads in the second sample referred in df by "_2" for second time point
1728        """
1729        
1730        counts = df.loc[:,['Clone_count_1', 'Clone_count_2']]
1731        counts['paircount'] = 1  # gives a weight of 1 to each observed clone
1732
1733        clone_counts = counts.groupby(['Clone_count_1', 'Clone_count_2']).sum()
1734        sparse_rep_counts = np.asarray(clone_counts.values.flatten(), dtype=int)
1735        clonecountpair_vals = clone_counts.index.values
1736        indn1 = np.asarray([clonecountpair_vals[it][0] for it in range(len(sparse_rep_counts))], dtype=int)
1737        indn2 = np.asarray([clonecountpair_vals[it][1] for it in range(len(sparse_rep_counts))], dtype=int)
1738        NreadsI = np.sum(counts['Clone_count_1'])
1739        NreadsII = np.sum(counts['Clone_count_2'])
1740
1741        unicountvals_1, indn1 = np.unique(indn1, return_inverse=True)
1742        unicountvals_2, indn2 = np.unique(indn2, return_inverse=True)
1743
1744        return indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII
1745
1746    
1747
1748    def _NegBinPar(self,m,v,mvec): 
1749        '''
1750        Same as NegBinParMtr, but for m and v being scalars.
1751        Assumes m>0.
1752        Output is (len(mvec),) array
1753        '''
1754        mmax=mvec[-1]
1755        p = 1-m/v
1756        r = m*m/v/p
1757        NBvec=np.arange(mmax+1,dtype=float)   
1758        NBvec[1:]=np.log((NBvec[1:]+r-1)/NBvec[1:]*p) #vectorization won't help unfortuneately here since log needs to be over array
1759        NBvec[0]=r*math.log(m/v)
1760        NBvec=np.exp(np.cumsum(NBvec)[mvec]) #save a bit here
1761        return NBvec
1762
1763
1764    def _NegBinParMtr(self,m,v,nvec): #speed up only insofar as the log and exp are called once on array instead of multiple times on rows
1765        ''' 
1766        computes NegBin probabilities over the ordered (but possibly discontiguous) vector (nvec) 
1767        for mean/variance combinations given by the mean (m) and variance (v) vectors. 
1768        Note that m<v for negative binomial.
1769        Output is (len(m),len(nvec)) array
1770        '''
1771        nmax=nvec[-1]
1772        p = 1-m/v
1773        r = m*m/v/p
1774        NBvec=np.arange(nmax+1,dtype=float)
1775        NBvec=np.log((NBvec+r[:,np.newaxis]-1)*(p[:,np.newaxis]/NBvec))
1776        NBvec[:,0]=r*np.log(m/v) #handle NBvec[0]=0, treated specially when m[0]=0, see below
1777        NBvec=np.exp(np.cumsum(NBvec,axis=1)) #save a bit here
1778        if m[0]==0:
1779            NBvec[0,:]=0.
1780            NBvec[0,0]=1.
1781        NBvec=NBvec[:,nvec]
1782        return NBvec
1783
1784    def _PoisPar(self, Mvec,unicountvals):
1785        #assert Mvec[0]==0, "first element needs to be zero"
1786        nmax=unicountvals[-1]
1787        nlen=len(unicountvals)
1788        mlen=len(Mvec)
1789        Nvec=unicountvals
1790        logNvec=-np.insert(np.cumsum(np.log(np.arange(1,nmax+1))),0,0.)[unicountvals] #avoid n=0 nans  
1791        Nmtr=np.exp(Nvec[np.newaxis,:]*np.log(Mvec)[:,np.newaxis]+logNvec[np.newaxis,:]-Mvec[:,np.newaxis]) # np.log(Mvec) throws warning: since log(0)=-inf
1792        if Mvec[0]==0:
1793            Nmtr[0,:]=np.zeros((nlen,)) #when m=0, n=0, and so get rid of nans from log(0)
1794            Nmtr[0,0]=1. #handled belowacq_model_type
1795        if unicountvals[0]==0: #if n=0 included get rid of nans from log(0)
1796            Nmtr[:,0]=np.exp(-Mvec)
1797        return Nmtr
1798
1799    def _get_rhof(self,alpha_rho, nfbins,fmin,freq_dtype):
1800        '''
1801        generates power law (power is alpha_rho) clone frequency distribution over 
1802        freq_nbins discrete logarithmically spaced frequences between fmin and 1 of dtype freq_dtype
1803        Outputs log probabilities obtained at log frequencies'''
1804        fmax=1e0
1805        logfvec=np.linspace(np.log10(fmin),np.log10(fmax), nfbins)
1806        logfvec=np.array(np.log(np.power(10,logfvec)) ,dtype=freq_dtype).flatten()  
1807        logrhovec=logfvec*alpha_rho
1808        integ=np.exp(logrhovec+logfvec,dtype=freq_dtype)
1809        normconst=np.log(np.dot(np.diff(logfvec)/2.,integ[1:]+integ[:-1]))
1810        logrhovec-=normconst 
1811        return logrhovec,logfvec
1812
1813    
1814    def _get_logPn_f(self,unicounts,Nreads,logfvec, noise_model, paras):
1815
1816        """
1817        tools to compute the likelihood of the noise model. It is not useful for the user.
1818        """
1819        
1820        # Choice of the model:
1821        
1822        if noise_model<1:
1823
1824            m_total=float(np.power(10, paras[3])) 
1825            r_c=Nreads/m_total
1826        if noise_model<2:
1827
1828            beta_mv= paras[1]
1829            alpha_mv=paras[2]
1830            
1831        if noise_model<1: #for models that include cell counts
1832            #compute parametrized range (mean-sigma,mean+5*sigma) of m values (number of cells) conditioned on n values (reads) appearing in the data only 
1833            nsigma=5.
1834            nmin=300.
1835            #for each n, get actual range of m to compute around n-dependent mean m
1836            m_low =np.zeros((len(unicounts),),dtype=int)
1837            m_high=np.zeros((len(unicounts),),dtype=int)
1838            for nit,n in enumerate(unicounts):
1839                mean_m=n/r_c
1840                dev=nsigma*np.sqrt(mean_m)
1841                m_low[nit] =int(mean_m-  dev) if (mean_m>dev**2) else 0                         
1842                m_high[nit]=int(mean_m+5*dev) if (      n>nmin) else int(10*nmin/r_c)
1843            m_cellmax=np.max(m_high)
1844            #across n, collect all in-range m
1845            mvec_bool=np.zeros((m_cellmax+1,),dtype=bool) #cheap bool
1846            nvec=range(len(unicounts))
1847            for nit in nvec:
1848                mvec_bool[m_low[nit]:m_high[nit]+1]=True  #mask vector
1849            mvec=np.arange(m_cellmax+1)[mvec_bool]                
1850            #transform to in-range index
1851            for nit in nvec:
1852                m_low[nit]=np.where(m_low[nit]==mvec)[0][0]
1853                m_high[nit]=np.where(m_high[nit]==mvec)[0][0]
1854
1855        Pn_f=np.zeros((len(logfvec),len(unicounts)))
1856        if noise_model==0:
1857
1858            mean_m=m_total*np.exp(logfvec)
1859            var_m=mean_m+beta_mv*np.power(mean_m,alpha_mv)
1860            Poisvec = self._PoisPar(mvec*r_c,unicounts)
1861            for f_it in range(len(logfvec)):
1862                NBvec=self._NegBinPar(mean_m[f_it],var_m[f_it],mvec)
1863                for n_it,n in enumerate(unicounts):
1864                    Pn_f[f_it,n_it]=np.dot(NBvec[m_low[n_it]:m_high[n_it]+1],Poisvec[m_low[n_it]:m_high[n_it]+1,n_it]) 
1865        
1866        elif noise_model==1:
1867
1868            mean_n=Nreads*np.exp(logfvec)
1869            var_n=mean_n+beta_mv*np.power(mean_n,alpha_mv)
1870            Pn_f = self._NegBinParMtr(mean_n,var_n,unicounts)
1871        elif noise_model==2:
1872
1873            mean_n=Nreads*np.exp(logfvec)
1874            Pn_f= self._PoisPar(mean_n,unicounts)
1875        else:
1876            print('acq_model is 0,1,or 2 only')
1877
1878        return np.log(Pn_f)
1879
1880    def _get_Ps(self, alp,sbar,smax,stp):
1881        '''
1882        generates symmetric exponential distribution over log fold change
1883        with effect size sbar and nonresponding fraction 1-alp at s=0.
1884        computed over discrete range of s from -smax to smax in steps of size stp
1885        '''
1886        lamb=-stp/sbar
1887        smaxt=round(smax/stp)
1888        s_zeroind=int(smaxt)
1889        Z=2*(np.exp((smaxt+1)*lamb)-1)/(np.exp(lamb)-1)-1
1890        Ps=alp*np.exp(lamb*np.fabs(np.arange(-smaxt,smaxt+1)))/Z
1891        Ps[s_zeroind]+=(1-alp)
1892        return Ps
1893
1894    def _callbackFdiffexpr(self, Xi): #case dependent
1895        '''prints iteration info. called scipy.minimize'''
1896               
1897        print('{0: 3.6f}   {1: 3.6f}   '.format(Xi[0], Xi[1])+'\n')   
1898    
1899
1900    def _learning_dynamics_expansion_polished(self, df, paras_1, paras_2,  noise_model):
1901        """
1902        function to infer the expansion mode parameters - not usable by the user.
1903        """
1904
1905        indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2,NreadsI,NreadsII = self.get_sparserep(df)
1906
1907        alpha_rho = paras_1[0]
1908        fmin = np.power(10,paras_1[-1])
1909        freq_dtype = 'float64'
1910        nfbins = 1200 #Accuracy of the integration
1911
1912
1913        logrhofvec, logfvec = get_rhof(self, alpha_rho, nfbins, fmin, freq_dtype)
1914
1915        #Definition of svec
1916        smax = 25.0     #maximum absolute logfold change value
1917        s_step = 0.1
1918        s_0 = -1
1919        
1920        s_step_old= s_step
1921        logf_step= logfvec[1] - logfvec[0] #use natural log here since f2 increments in increments in exp().  
1922        f2s_step= int(round(s_step/logf_step)) #rounded number of f-steps in one s-step
1923        s_step= float(f2s_step)*logf_step
1924        smax= s_step*(smax/s_step_old)
1925        svec= s_step*np.arange(0,int(round(smax/s_step)+1))   
1926        svec= np.append(-svec[1:][::-1],svec)
1927
1928        smaxind=(len(svec)-1)/2
1929        f2s_step=int(round(s_step/logf_step)) #rounded number of f-steps in one s-step
1930        logfmin=logfvec[0 ]-f2s_step*smaxind*logf_step
1931        logfmax=logfvec[-1]+f2s_step*smaxind*logf_step
1932        
1933        logfvecwide = np.linspace(logfmin,logfmax,len(logfvec)+2*smaxind*f2s_step) #a wider domain for the second frequency f2=f1*exp(s)
1934            
1935        # Compute P(n1|f) and P(n2|f), each in an iteration of the following loop
1936
1937        for it in range(2):
1938            if it == 0:
1939                unicounts=unicountvals_1
1940                logfvec_tmp=deepcopy(logfvec)
1941                Nreads = NreadsI
1942                paras = paras_1
1943            else:
1944                unicounts=unicountvals_2
1945                logfvec_tmp=deepcopy(logfvecwide) #contains s-shift for sampled data method
1946                Nreads = NreadsII
1947                paras = paras_2
1948            if it == 0:
1949                logPn1_f = self._get_logPn_f( unicounts, Nreads, logfvec_tmp, noise_model, paras)
1950
1951            else:
1952                logPn2_f = self._get_logPn_f(unicounts, Nreads, logfvec_tmp, noise_model, paras)
1953
1954        #for the trapezoid method
1955        dlogfby2=np.diff(logfvec)/2 
1956
1957        # Computing P(n1,n2|f,s)
1958        Pn1n2_s=np.zeros((len(svec), len(unicountvals_1), len(unicountvals_2))) 
1959
1960        for s_it,s in enumerate(svec):
1961            for it,(n1_it, n2_it) in enumerate(zip(indn1,indn2)):
1962                integ = np.exp(logrhofvec+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),n2_it]+logPn1_f[:,n1_it]+ logfvec )
1963                Pn1n2_s[s_it, n1_it, n2_it] = np.dot(dlogfby2,integ[1:] + integ[:-1])
1964            
1965    
1966        Pn0n0_s = np.zeros(svec.shape)
1967        for s_it,s in enumerate(svec):    
1968            integ=np.exp(logPn1_f[:,0]+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),0]+logrhofvec+logfvec)
1969            Pn0n0_s[s_it]=np.dot(dlogfby2,integ[1:]+integ[:-1])
1970            
1971    
1972        N_obs = np.sum(sparse_rep_counts)
1973        print("N_obs: " + str(N_obs))
1974    
1975            
1976        def cost(PARAS):
1977
1978            alp = PARAS[0]
1979            sbar = PARAS[1]
1980
1981            Ps = _get_Ps(self,alp,sbar,smax,s_step)
1982            Pn0n0=np.dot(Pn0n0_s,Ps)
1983            Pn1n2_ps=np.sum(Pn1n2_s*Ps[:,np.newaxis,np.newaxis],0)
1984            Pn1n2_ps/=1-Pn0n0
1985            print(Pn0n0)
1986
1987       
1988
1989            Energy = - np.dot(sparse_rep_counts/float(N_obs),np.where(Pn1n2_ps[indn1,indn2]>0,np.log(Pn1n2_ps[indn1,indn2]),0)) 
1990                
1991            return Energy
1992
1993    #--------------------------Compute-the-grid-----------------------------------------
1994        
1995        print('Calculation Surface : \n')
1996        st = time.time()
1997
1998        npoints = 20 #to be chosen by the user 
1999        alpvec = np.logspace(-3,np.log10(0.99), npoints)
2000        sbarvec = np.linspace(0.01,5, npoints)
2001
2002        LSurface =np.zeros((len(sbarvec),len(alpvec)))
2003        for i in range(len(sbarvec)):
2004            for j in range(len(alpvec)):
2005                LSurface[i, j]=  - cost([alpvec[j], sbarvec[i]])
2006        
2007        alpmesh, sbarmesh = np.meshgrid(alpvec, sbarvec)
2008        a,b = np.where(LSurface == np.max(LSurface))
2009        print("--- %s seconds ---" % (time.time() - st))
2010    
2011    
2012    #------------------------------Optimization----------------------------------------------
2013        
2014        optA = alpmesh[a[0],b[0]]
2015        optB = sbarmesh[a[0],b[0]]
2016                  
2017        print('polish parameter estimate from '+ str(optA)+' '+str(optB))
2018        initparas=(optA,optB)  
2019    
2020
2021        outstruct = minimize(cost, initparas, method='SLSQP', callback=_callbackFdiffexpr, tol=1e-6,options={'ftol':1e-8 ,'disp': True,'maxiter':300})
2022
2023        return outstruct.x, Pn1n2_s, Pn0n0_s, svec
2024
2025    def _learning_dynamics_expansion(self, sparse_rep, paras_1, paras_2, noise_model, display_plot=False):
2026        """
2027        function to infer the expansion mode parameters - not usable by the user.
2028        """
2029
2030        indn1,indn2,sparse_rep_counts,unicountvals_1,unicountvals_2,NreadsI,NreadsII = sparse_rep
2031
2032        alpha_rho = paras_1[0]
2033        fmin = np.power(10,paras_1[-1])
2034        freq_dtype = 'float64'
2035        nfbins = 1200 #Accuracy of the integration
2036
2037
2038        logrhofvec, logfvec = self.get_rhof(alpha_rho, nfbins, fmin, freq_dtype)
2039
2040        #Definition of svec
2041        smax = 25.0     #maximum absolute logfold change value
2042        s_step = 0.1
2043        s_0 = -1
2044        
2045        s_step_old= s_step
2046        logf_step= logfvec[1] - logfvec[0] #use natural log here since f2 increments in increments in exp().  
2047        f2s_step= int(round(s_step/logf_step)) #rounded number of f-steps in one s-step
2048        s_step= float(f2s_step)*logf_step
2049        smax= s_step*(smax/s_step_old)
2050        svec= s_step*np.arange(0,int(round(smax/s_step)+1))   
2051        svec= np.append(-svec[1:][::-1],svec)
2052
2053        smaxind=(len(svec)-1)/2
2054        f2s_step=int(round(s_step/logf_step)) #rounded number of f-steps in one s-step
2055        logfmin=logfvec[0 ]-f2s_step*smaxind*logf_step
2056        logfmax=logfvec[-1]+f2s_step*smaxind*logf_step
2057        
2058        logfvecwide = np.linspace(logfmin,logfmax,int(len(logfvec)+2*smaxind*f2s_step)) #a wider domain for the second frequency f2=f1*exp(s)
2059            
2060        # Compute P(n1|f) and P(n2|f), each in an iteration of the following loop
2061
2062        for it in range(2):
2063            if it == 0:
2064                unicounts=unicountvals_1
2065                logfvec_tmp=deepcopy(logfvec)
2066                Nreads = NreadsI
2067                paras = paras_1
2068            else:
2069                unicounts=unicountvals_2
2070                logfvec_tmp=deepcopy(logfvecwide) #contains s-shift for sampled data method
2071                Nreads = NreadsII
2072                paras = paras_2
2073            if it == 0:
2074                logPn1_f = self._get_logPn_f(unicounts, Nreads, logfvec_tmp, noise_model, paras)
2075
2076            else:
2077                logPn2_f = self._get_logPn_f(unicounts, Nreads, logfvec_tmp, noise_model, paras)
2078
2079        #for the trapezoid method
2080        dlogfby2=np.diff(logfvec)/2 
2081
2082        # Computing P(n1,n2|f,s)
2083        Pn1n2_s=np.zeros((len(svec), len(unicountvals_1), len(unicountvals_2))) 
2084
2085        for s_it,s in enumerate(svec):
2086            for it,(n1_it, n2_it) in enumerate(zip(indn1,indn2)):
2087                integ = np.exp(logrhofvec+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),n2_it]+logPn1_f[:,n1_it]+ logfvec )
2088                Pn1n2_s[s_it, n1_it, n2_it] = np.dot(dlogfby2,integ[1:] + integ[:-1])
2089            
2090    
2091        Pn0n0_s = np.zeros(svec.shape)
2092        for s_it,s in enumerate(svec):    
2093            integ=np.exp(logPn1_f[:,0]+logPn2_f[f2s_step*s_it:(f2s_step*s_it+len(logfvec)),0]+logrhofvec+logfvec)
2094            Pn0n0_s[s_it]=np.dot(dlogfby2,integ[1:]+integ[:-1])
2095            
2096   
2097        N_obs = np.sum(sparse_rep_counts)
2098        print("N_obs: " + str(N_obs))
2099    
2100            
2101        def cost(PARAS):
2102
2103            alp = PARAS[0]
2104            sbar = PARAS[1]
2105
2106            Ps = self._get_Ps(alp,sbar,smax,s_step)
2107            Pn0n0=np.dot(Pn0n0_s,Ps)
2108            Pn1n2_ps=np.sum(Pn1n2_s*Ps[:,np.newaxis,np.newaxis],0)
2109            Pn1n2_ps/=1-Pn0n0
2110            #print(Pn0n0)
2111
2112       
2113
2114            Energy = - np.dot(sparse_rep_counts/float(N_obs),np.where(Pn1n2_ps[indn1,indn2]>0,np.log(Pn1n2_ps[indn1,indn2]),0)) 
2115                
2116            return Energy
2117
2118    #--------------------------Compute-the-grid-----------------------------------------
2119        
2120        print('Calculation Surface : \n')
2121        st = time.time()
2122
2123        npoints = 50 #to be chosen by the user 
2124        alpvec = np.logspace(-3,np.log10(0.99), npoints)
2125        sbarvec = np.linspace(0.01,5, npoints)
2126
2127        LSurface =np.zeros((len(sbarvec),len(alpvec)))
2128        for i in range(len(sbarvec)):
2129            for j in range(len(alpvec)):
2130                LSurface[i, j]=  - cost([alpvec[j], sbarvec[i]])
2131        
2132        alpmesh, sbarmesh = np.meshgrid(alpvec, sbarvec)
2133        a,b = np.where(LSurface == np.max(LSurface))
2134        print("--- %s seconds ---" % (time.time() - st))
2135    
2136    #---------------------------Plot-the-grid-------------------------------------------
2137        if display_plot:
2138
2139            fig, ax =plt.subplots(1, figsize=(10,8))
2140
2141         
2142            a,b = np.where(LSurface == np.max(LSurface))
2143
2144            ax.contour(alpmesh, sbarmesh, LSurface, linewidths=1, colors='k', linestyles = 'solid')
2145            plt.contourf(alpmesh, sbarmesh, LSurface, 20, cmap = 'viridis', alpha= 0.8)
2146
2147            xmax = alpmesh[a[0],b[0]]
2148            ymax = sbarmesh[a[0],b[0]]
2149            text= r"$ alpha={:.3f}, s={:.3f} $".format(xmax, ymax)
2150            bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72)
2151            arrowprops=dict(arrowstyle="->",connectionstyle="angle,angleA=0,angleB=80")
2152            kw = dict(xycoords='data',textcoords="axes fraction",
2153                arrowprops=arrowprops, bbox=bbox_props, ha="right", va="top")
2154            plt.annotate(text, xy=(xmax, ymax), xytext=(0.94,0.96), **kw)
2155            plt.xlabel(r'$ \alpha, \ size \ of \ the \ repertoire \ that \ answers \ to \ the \ vaccine $') 
2156            plt.ylabel(r'$ s_{bar}, \ characteristic \ expansion \ decrease $')
2157            plt.xscale('log')
2158            plt.yscale('log')
2159            plt.grid()
2160            plt.title(r'$Grid \ Search \ graph \ for \ \alpha \ and \ s_{bar} \ parameters. $')
2161            plt.colorbar()
2162
2163        return LSurface, Pn1n2_s, Pn0n0_s, svec
2164 
2165
2166    def _save_table(self, outpath, svec, Ps,Pn1n2_s, Pn0n0_s,  subset, unicountvals_1_d, unicountvals_2_d, indn1_d, indn2_d, print_expanded, pthresh, smedthresh):
2167        '''
2168        takes learned diffexpr model, Pn1n2_s*Ps, computes posteriors over (n1,n2) pairs, and writes to file a table of data with clones as rows and columns as measures of thier posteriors 
2169        print_expanded=True orders table as ascending by , else descending
2170        pthresh is the threshold in 'p-value'-like (null hypo) probability, 1-P(s>0|n1_i,n2_i), where i is the row (i.e. the clone) n.b. lower null prob implies larger probability of expansion
2171        smedthresh is the threshold on the posterior median, below which clones are discarded
2172        not usable by the user. 
2173        '''
2174
2175        Psn1n2_ps=Pn1n2_s*Ps[:,np.newaxis,np.newaxis] 
2176    
2177        #compute marginal likelihood (neglect renormalization , since it cancels in conditional below) 
2178        Pn1n2_ps=np.sum(Psn1n2_ps,0)
2179
2180        Ps_n1n2ps=Pn1n2_s*Ps[:,np.newaxis,np.newaxis]/Pn1n2_ps[np.newaxis,:,:]
2181        #compute cdf to get p-value to threshold on to reduce output size
2182        cdfPs_n1n2ps=np.cumsum(Ps_n1n2ps,0)
2183    
2184
2185        def dummy(row,cdfPs_n1n2ps,unicountvals_1_d,unicountvals_2_d):
2186            '''
2187            when applied to dataframe, generates 'p-value'-like (null hypo) probability, 1-P(s>0|n1_i,n2_i), where i is the row (i.e. the clone)
2188            '''
2189            return cdfPs_n1n2ps[np.argmin(np.fabs(svec)),row['Clone_count_1']==unicountvals_1_d,row['Clone_count_2']==unicountvals_2_d][0]
2190        dummy_part=partial(dummy,cdfPs_n1n2ps=cdfPs_n1n2ps,unicountvals_1_d=unicountvals_1_d,unicountvals_2_d=unicountvals_2_d)
2191    
2192        cdflabel=r'$1-P(s>0)$'
2193        subset[cdflabel]=subset.apply(dummy_part, axis=1)
2194        subset=subset[subset[cdflabel]<pthresh].reset_index(drop=True)
2195
2196        #go from clone count pair (n1,n2) to index in unicountvals_1_d and unicountvals_2_d
2197        data_pairs_ind_1=np.zeros((len(subset),),dtype=int)
2198        data_pairs_ind_2=np.zeros((len(subset),),dtype=int)
2199        for it in range(len(subset)):
2200            data_pairs_ind_1[it]=np.where(int(subset.iloc[it].Clone_count_1)==unicountvals_1_d)[0]
2201            data_pairs_ind_2[it]=np.where(int(subset.iloc[it].Clone_count_2)==unicountvals_2_d)[0]   
2202        #posteriors over data clones
2203        Ps_n1n2ps_datpairs=Ps_n1n2ps[:,data_pairs_ind_1,data_pairs_ind_2]
2204    
2205        #compute posterior metrics
2206        mean_est=np.zeros((len(subset),))
2207        max_est= np.zeros((len(subset),))
2208        slowvec= np.zeros((len(subset),))
2209        smedvec= np.zeros((len(subset),))
2210        shighvec=np.zeros((len(subset),))
2211        pval=0.025 #double-sided comparison statistical test
2212        pvalvec=[pval,0.5,1-pval] #bound criteria defining slow, smed, and shigh, respectively
2213        for it,column in enumerate(np.transpose(Ps_n1n2ps_datpairs)):
2214            mean_est[it]=np.sum(svec*column)
2215            max_est[it]=svec[np.argmax(column)]
2216            forwardcmf=np.cumsum(column)
2217            backwardcmf=np.cumsum(column[::-1])[::-1]
2218            inds=np.where((forwardcmf[:-1]<pvalvec[0]) & (forwardcmf[1:]>=pvalvec[0]))[0]
2219            slowvec[it]=np.mean(svec[inds+np.ones((len(inds),),dtype=int)])  #use mean in case there are two values
2220            inds=np.where((forwardcmf>=pvalvec[1]) & (backwardcmf>=pvalvec[1]))[0]
2221            smedvec[it]=np.mean(svec[inds])
2222            inds=np.where((forwardcmf[:-1]<pvalvec[2]) & (forwardcmf[1:]>=pvalvec[2]))[0]
2223            shighvec[it]=np.mean(svec[inds+np.ones((len(inds),),dtype=int)])
2224    
2225        colnames=(r'$\bar{s}$',r'$s_{max}$',r'$s_{3,high}$',r'$s_{2,med}$',r'$s_{1,low}$')
2226        for it,coldata in enumerate((mean_est,max_est,shighvec,smedvec,slowvec)):
2227            subset.insert(0,colnames[it],coldata)
2228        oldcolnames=( 'AACDR3',  'ntCDR3', 'Clone_count_1', 'Clone_count_2', 'Clone_fraction_1', 'Clone_fraction_2')
2229        newcolnames=('CDR3_AA', 'CDR3_nt',        r'$n_1$',        r'$n_2$',           r'$f_1$',           r'$f_2$')
2230        subset=subset.rename(columns=dict(zip(oldcolnames, newcolnames)))
2231    
2232        #select only clones whose posterior median pass the given threshold
2233        subset=subset[subset[r'$s_{2,med}$']>smedthresh]
2234    
2235        print("writing to: "+outpath)
2236        if print_expanded:
2237            subset=subset.sort_values(by=cdflabel,ascending=True)
2238            strout='expanded'
2239        else:
2240            subset=subset.sort_values(by=cdflabel,ascending=False)
2241            strout='contracted'
2242        subset.to_csv(outpath+'top_'+strout+'.csv',sep='\t',index=False)
2243
2244
2245
2246    def expansion_table(self, outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold):
2247
2248        '''
2249        generate the table of clones that have been significantly detected to be responsive to an acute stimuli.    
2250    
2251        Parameters
2252        ----------
2253        outpath  : str
2254            Name of the directory where to store the output table
2255        paras_1  : numpy array
2256            parameters of the noise model that has been learned at time_1
2257        paras_2  : numpy array
2258            parameters of the noise model that has been learned at time_2
2259        df       : pandas dataframe 
2260            pandas dataframe merging the two RepSeq data at time_1 and time_2
2261        noise_model : int
2262            choice of noise model 0: Poisson, 1: negative Binomial, 2: negative Binomial + Poisson  
2263        pval_threshold : float
2264            P-value threshold to detect and discriminate if a TCR clone has expanded 
2265        smed_threshold : float
2266            median of the log-fold change threshold to detect if a TCR clone has expanded 
2267        Returns
2268        -------
2269        data-frame - csv file
2270            the output is a csv file of columns : $s_{1-low}$, $s_{2-med}$, $s_{3-high}$, $s_{max}$, $\bar{s}$, $f_1$, $f_2$, $n_1$, $n_2$, 'CDR3_nt', 'CDR3_AA' and '$p$-value'
2271        '''
2272
2273        sparse_rep = self.get_sparserep(df)
2274        L_surface, Pn1n2_s_d, Pn0n0_s_d, svec = self._learning_dynamics_expansion(sparse_rep, paras_1, paras_2, noise_model)
2275        npoints= 50 # same as in learning_dynamics_expansion
2276        smax = 25.0     
2277        s_step = 0.1
2278        alpvec = np.logspace(-3,np.log10(0.99), npoints)
2279        sbarvec = np.linspace(0.01,5, npoints)
2280        maxinds=np.unravel_index(np.argmax(L_surface),np.shape(L_surface))
2281        optsbar=sbarvec[maxinds[0]]
2282        optalp=alpvec[maxinds[1]]
2283        optPs= self._get_Ps(optalp,optsbar,smax,s_step)
2284        pval_expanded = True
2285
2286        indn1,indn2,sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII = sparse_rep
2287
2288        self._save_table(outpath, svec, optPs, Pn1n2_s_d, Pn0n0_s_d,  df, unicountvals_1, unicountvals_2, indn1, indn2, pval_expanded, pval_threshold, smed_threshold)

A class used to build an object associated to methods in order to select significant expanding or contracting clones from RepSeq samples taken at two different time points. ...

Methods

get_sparserep(df) : get sparse representation of the abundances / frequencies of the TCR clones present in RepSeq samples of both time points. This changes the data input to fasten the algorithm expansion_table(outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold): generate the table of clones that have been significantly detected to be responsive to an acute stimuli.

Expansion_Model()
def get_sparserep(self, df):
1699    def get_sparserep(self, df): 
1700        """
1701        Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation.
1702        unicountvals_1(2) are the unique values of n1(2).
1703        sparse_rep_counts gives the counts of unique pairs.
1704        ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair.
1705        len(indn1)=len(indn2)=len(sparse_rep_counts)
1706        Parameters
1707        ----------
1708        df : pandas data frame
1709            data-frame which is the output of the method .import_data() for one Data_Process instance.
1710            these data-frame should give the list of TCR clones present in two RepSeq samples, talen at two 
1711            different time points, associated to their clone frequencies and clone abundances in the first and second replicate?
1712        Returns
1713        -------
1714        indn1
1715            numpy array list of indexes of all values of unicountvals_1
1716        indn2
1717            numpy array list of indexes of all values of unicountvals_2
1718        sparse_rep_counts
1719            numpy array, # of clones having the read counts pair {(n1,n2)} 
1720        unicountvals_1
1721            numpy array list of unique counts values present in the first sample in df[clone_count_1]
1722        unicountvals_2
1723            numpy array list of unique counts values present in the second sample in df[clone_count_2]
1724        Nreads1
1725            float, total number of counts/reads in the first sample referred in df by "_1" for first time point
1726        Nreads2
1727            float, total number of counts/reads in the second sample referred in df by "_2" for second time point
1728        """
1729        
1730        counts = df.loc[:,['Clone_count_1', 'Clone_count_2']]
1731        counts['paircount'] = 1  # gives a weight of 1 to each observed clone
1732
1733        clone_counts = counts.groupby(['Clone_count_1', 'Clone_count_2']).sum()
1734        sparse_rep_counts = np.asarray(clone_counts.values.flatten(), dtype=int)
1735        clonecountpair_vals = clone_counts.index.values
1736        indn1 = np.asarray([clonecountpair_vals[it][0] for it in range(len(sparse_rep_counts))], dtype=int)
1737        indn2 = np.asarray([clonecountpair_vals[it][1] for it in range(len(sparse_rep_counts))], dtype=int)
1738        NreadsI = np.sum(counts['Clone_count_1'])
1739        NreadsII = np.sum(counts['Clone_count_2'])
1740
1741        unicountvals_1, indn1 = np.unique(indn1, return_inverse=True)
1742        unicountvals_2, indn2 = np.unique(indn2, return_inverse=True)
1743
1744        return indn1, indn2, sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII

Tranforms {(n1,n2)} data stored in pandas dataframe to a sparse 1D representation. unicountvals_1(2) are the unique values of n1(2). sparse_rep_counts gives the counts of unique pairs. ndn1(2) is the index of unicountvals_1(2) giving the value of n1(2) in that unique pair. len(indn1)=len(indn2)=len(sparse_rep_counts)

Parameters
  • df (pandas data frame): data-frame which is the output of the method .import_data() for one Data_Process instance. these data-frame should give the list of TCR clones present in two RepSeq samples, talen at two different time points, associated to their clone frequencies and clone abundances in the first and second replicate?
Returns
  • indn1: numpy array list of indexes of all values of unicountvals_1
  • indn2: numpy array list of indexes of all values of unicountvals_2
  • sparse_rep_counts: numpy array, # of clones having the read counts pair {(n1,n2)}
  • unicountvals_1: numpy array list of unique counts values present in the first sample in df[clone_count_1]
  • unicountvals_2: numpy array list of unique counts values present in the second sample in df[clone_count_2]
  • Nreads1: float, total number of counts/reads in the first sample referred in df by "_1" for first time point
  • Nreads2: float, total number of counts/reads in the second sample referred in df by "_2" for second time point
def expansion_table( self, outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold):
2246    def expansion_table(self, outpath, paras_1, paras_2, df, noise_model, pval_threshold, smed_threshold):
2247
2248        '''
2249        generate the table of clones that have been significantly detected to be responsive to an acute stimuli.    
2250    
2251        Parameters
2252        ----------
2253        outpath  : str
2254            Name of the directory where to store the output table
2255        paras_1  : numpy array
2256            parameters of the noise model that has been learned at time_1
2257        paras_2  : numpy array
2258            parameters of the noise model that has been learned at time_2
2259        df       : pandas dataframe 
2260            pandas dataframe merging the two RepSeq data at time_1 and time_2
2261        noise_model : int
2262            choice of noise model 0: Poisson, 1: negative Binomial, 2: negative Binomial + Poisson  
2263        pval_threshold : float
2264            P-value threshold to detect and discriminate if a TCR clone has expanded 
2265        smed_threshold : float
2266            median of the log-fold change threshold to detect if a TCR clone has expanded 
2267        Returns
2268        -------
2269        data-frame - csv file
2270            the output is a csv file of columns : $s_{1-low}$, $s_{2-med}$, $s_{3-high}$, $s_{max}$, $\bar{s}$, $f_1$, $f_2$, $n_1$, $n_2$, 'CDR3_nt', 'CDR3_AA' and '$p$-value'
2271        '''
2272
2273        sparse_rep = self.get_sparserep(df)
2274        L_surface, Pn1n2_s_d, Pn0n0_s_d, svec = self._learning_dynamics_expansion(sparse_rep, paras_1, paras_2, noise_model)
2275        npoints= 50 # same as in learning_dynamics_expansion
2276        smax = 25.0     
2277        s_step = 0.1
2278        alpvec = np.logspace(-3,np.log10(0.99), npoints)
2279        sbarvec = np.linspace(0.01,5, npoints)
2280        maxinds=np.unravel_index(np.argmax(L_surface),np.shape(L_surface))
2281        optsbar=sbarvec[maxinds[0]]
2282        optalp=alpvec[maxinds[1]]
2283        optPs= self._get_Ps(optalp,optsbar,smax,s_step)
2284        pval_expanded = True
2285
2286        indn1,indn2,sparse_rep_counts, unicountvals_1, unicountvals_2, NreadsI, NreadsII = sparse_rep
2287
2288        self._save_table(outpath, svec, optPs, Pn1n2_s_d, Pn0n0_s_d,  df, unicountvals_1, unicountvals_2, indn1, indn2, pval_expanded, pval_threshold, smed_threshold)

generate the table of clones that have been significantly detected to be responsive to an acute stimuli.

Parameters
  • outpath (str): Name of the directory where to store the output table
  • paras_1 (numpy array): parameters of the noise model that has been learned at time_1
  • paras_2 (numpy array): parameters of the noise model that has been learned at time_2
  • df (pandas dataframe): pandas dataframe merging the two RepSeq data at time_1 and time_2
  • noise_model (int): choice of noise model 0: Poisson, 1: negative Binomial, 2: negative Binomial + Poisson
  • pval_threshold (float): P-value threshold to detect and discriminate if a TCR clone has expanded
  • smed_threshold (float): median of the log-fold change threshold to detect if a TCR clone has expanded
Returns
  • data-frame - csv file: the output is a csv file of columns : $s_{1-low}$, $s_{2-med}$, $s_{3-high}$, $s_{max}$, $ar{s}$, $f_1$, $f_2$, $n_1$, $n_2$, 'CDR3_nt', 'CDR3_AA' and '$p$-value'
class Generator:
2293class Generator:
2294
2295    """
2296    A class used to build an object to generate in-Silico (synthetic) RepSeq samples, in the case of replicates at
2297    the same day and in the case of having 2 samples generated at an initial time for the first one and some time after (months, years)
2298    for the second one using the geometric Brownian motion model decribed in https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1.
2299    ...
2300    Methods
2301    -------
2302    gen_synthetic_data_Null(paras, noise_model, NreadsI,NreadsII,Nsamp):
2303        generate in-silico same day RepSeq replicates.
2304    generate_trajectories(tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI = '1e6', NreadsII = '1e6'):
2305        generate in-silico t_ime apart RepSeq samples.
2306    """
2307
2308    def _get_rhof(self, alpha_rho, fmin, freq_nbins=800, freq_dtype='float64'):
2309
2310        '''
2311        generates power law (power is alpha_rho) clone frequency distribution over 
2312        freq_nbins discrete logarithmically spaced frequences between fmin and 1 of dtype freq_dtype
2313        Outputs log probabilities obtained at log frequencies'''
2314        fmax=1e0
2315        logfvec=np.linspace(np.log10(fmin),np.log10(fmax),freq_nbins)
2316        logfvec=np.array(np.log(np.power(10,logfvec)) ,dtype=freq_dtype).flatten()  
2317        logrhovec=logfvec*alpha_rho
2318        integ=np.exp(logrhovec+logfvec,dtype=freq_dtype)
2319        normconst=np.log(np.dot(np.diff(logfvec)/2.,integ[1:]+integ[:-1]))
2320        logrhovec-=normconst 
2321        return logrhovec,logfvec
2322
2323    def _get_distsample(self, pmf,Nsamp,dtype='uint32'):
2324        '''
2325        generates Nsamp index samples of dtype (e.g. uint16 handles up to 65535 indices) from discrete probability mass function pmf.
2326        Handles multi-dimensional domain. N.B. Output is sorted.
2327        '''
2328        #assert np.sum(pmf)==1, "cmf not normalized!"
2329    
2330        shape = np.shape(pmf)
2331        sortindex = np.argsort(pmf, axis=None)#uses flattened array
2332        pmf = pmf.flatten()
2333        pmf = pmf[sortindex]
2334        cmf = np.cumsum(pmf)
2335        choice = np.random.uniform(high = cmf[-1], size = int(float(Nsamp)))
2336        index = np.searchsorted(cmf, choice)
2337        index = sortindex[index]
2338        index = np.unravel_index(index, shape)
2339        index = np.transpose(np.vstack(index))
2340        sampled_inds = np.array(index[np.argsort(index[:,0])],dtype=dtype)
2341        return sampled_inds
2342
2343    
2344    def gen_synthetic_data_Null(self, paras, noise_model, NreadsI,NreadsII,Nsamp):
2345        '''
2346        outputs an array of observed clone frequencies and corresponding dataframe of pair counts
2347        for a null model learned from a dataset pair with NreadsI and NreadsII number of reads, respectively.
2348        Crucial for RAM efficiency, sampling is conditioned on being observed in each of the three (n,0), (0,n'), and n,n'>0 conditions
2349        so that only Nsamp clones need to be sampled, rather than the N clones in the repertoire.
2350        Note that no explicit normalization is applied. It is assumed that the values in paras are consistent with N<f>=1 
2351        (e.g. were obtained through the learning done in this package).
2352        '''
2353
2354    
2355        alpha = paras[0] #power law exponent
2356        fmin=np.power(10,paras[-1])
2357        if noise_model<1:
2358            m_total=float(np.power(10, paras[3])) 
2359            r_c1=NreadsI/m_total
2360            r_c2=NreadsII/m_total
2361            r_cvec=[r_c1,r_c2]
2362        if noise_model<2:
2363            beta_mv= paras[1]
2364            alpha_mv=paras[2]
2365    
2366        logrhofvec,logfvec = self.get_rhof(alpha,fmin)
2367        fvec=np.exp(logfvec)
2368        dlogf=np.diff(logfvec)/2.
2369    
2370        #generate measurement model distribution, Pn_f
2371        Pn_f=np.empty((len(logfvec),),dtype=object) #len(logfvec) samplers
2372    
2373        #get value at n=0 to use for conditioning on n>0 (and get full Pn_f here if noise_model=1,2)
2374        m_max=1e3 #conditioned on n=0, so no edge effects
2375    
2376        Nreadsvec=(NreadsI,NreadsII)
2377        for it in range(2):
2378            Pn_f=np.empty((len(fvec),),dtype=object)
2379            if noise_model==2:
2380                m1vec=Nreadsvec[it]*fvec
2381                for find,m1 in enumerate(m1vec):
2382                    Pn_f[find]=poisson(m1)
2383                logPn0_f=-m1vec
2384            elif noise_model==1:
2385                m1=Nreadsvec[it]*fvec
2386                v1=m1+beta_mv*np.power(m1,alpha_mv)
2387                p=1-m1/v1
2388                n=m1*m1/v1/p
2389                for find,(n,p) in enumerate(zip(n,p)):
2390                    Pn_f[find]=nbinom(n,1-p)
2391                Pn0_f=np.asarray([Pn_find.pmf(0) for Pn_find in Pn_f])
2392                logPn0_f=np.log(Pn0_f)
2393            
2394            elif noise_model==0:
2395                m1=m_total*fvec
2396                v1=m1+beta_mv*np.power(m1,alpha_mv)
2397                p=1-m1/v1
2398                n=m1*m1/v1/p
2399                Pn0_f=np.zeros((len(fvec),))
2400                for find in range(len(Pn0_f)):
2401                    nbtmp=nbinom(n[find],1-p[find]).pmf(np.arange(m_max+1))
2402                    ptmp=poisson(r_cvec[it]*np.arange(m_max+1)).pmf(0)
2403                    Pn0_f[find]=np.sum(np.exp(np.log(nbtmp)+np.log(ptmp)))
2404                logPn0_f=np.log(Pn0_f)
2405            else:
2406                print('acq_model is 0,1,or 2 only')
2407            
2408            if it==0:
2409                Pn1_f=Pn_f
2410                logPn10_f=logPn0_f
2411            else:
2412                Pn2_f=Pn_f
2413                logPn20_f=logPn0_f
2414
2415        #3-quadrant q|f conditional distribution (qx0:n1>0,n2=0;q0x:n1=0,n2>0;qxx:n1,n2>0)
2416        logPqx0_f=np.log(1-np.exp(logPn10_f))+logPn20_f
2417        logPq0x_f=logPn10_f+np.log(1-np.exp(logPn20_f))
2418        logPqxx_f=np.log(1-np.exp(logPn10_f))+np.log(1-np.exp(logPn20_f))
2419        #3-quadrant q,f joint distribution
2420        logPfqx0=logPqx0_f+logrhofvec
2421        logPfq0x=logPq0x_f+logrhofvec
2422        logPfqxx=logPqxx_f+logrhofvec
2423        #3-quadrant q marginal distribution 
2424        Pqx0=np.trapz(np.exp(logPfqx0+logfvec),x=logfvec)
2425        Pq0x=np.trapz(np.exp(logPfq0x+logfvec),x=logfvec)
2426        Pqxx=np.trapz(np.exp(logPfqxx+logfvec),x=logfvec)
2427    
2428        #3 quadrant conditional f|q distribution
2429        Pf_qx0=np.where(Pqx0>0,np.exp(logPfqx0-np.log(Pqx0)),0)
2430        Pf_q0x=np.where(Pq0x>0,np.exp(logPfq0x-np.log(Pq0x)),0)
2431        Pf_qxx=np.where(Pqxx>0,np.exp(logPfqxx-np.log(Pqxx)),0)
2432    
2433        #3-quadrant q marginal distribution
2434        newPqZ=Pqx0 + Pq0x + Pqxx
2435        Pqx0/=newPqZ
2436        Pq0x/=newPqZ
2437        Pqxx/=newPqZ
2438
2439        Pfqx0=np.exp(logPfqx0)
2440        Pfq0x=np.exp(logPfq0x)
2441        Pfqxx=np.exp(logPfqxx)
2442    
2443        print('Model probs: '+str(Pqx0)+' '+str(Pq0x)+' '+str(Pqxx))
2444
2445        #get samples 
2446        num_samples=Nsamp
2447        q_samples=np.random.choice(range(3), num_samples, p=(Pqx0,Pq0x,Pqxx))
2448        vals,counts=np.unique(q_samples,return_counts=True)
2449        num_qx0=counts[0]
2450        num_q0x=counts[1]
2451        num_qxx=counts[2]
2452        print('q samples: '+str(sum(counts))+' '+str(num_qx0)+' '+str(num_q0x)+' '+str(num_qxx))
2453        print('q sampled probs: '+str(num_qx0/float(sum(counts)))+' '+str(num_q0x/float(sum(counts)))+' '+str(num_qxx/float(sum(counts))))
2454    
2455        #x0
2456        integ=np.exp(np.log(Pf_qx0)+logfvec)
2457        f_samples_inds= self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_qx0).flatten()
2458        f_sorted_inds=np.argsort(f_samples_inds)
2459        f_samples_inds=f_samples_inds[f_sorted_inds] 
2460        qx0_f_samples=fvec[f_samples_inds]
2461        find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True)
2462        qx0_samples=np.zeros((num_qx0,))
2463        if noise_model<1:
2464            qx0_m_samples=np.zeros((num_qx0,))
2465            #conditioning on n>0 applies an m-dependent factor to Pm_f, which can't be incorporated into the ppf method used for noise_model 1 and 2. 
2466            #We handle that here by using a custom finite range sampler, which has the drawback of having to define an upper limit. 
2467            #This works so long as n_max/r_c<<m_max, so depends on highest counts in data (n_max). My data had max counts of 1e3-1e4.
2468            #Alternatively, could define a custom scipy RV class by defining it's PMF, but has to be array-compatible which requires care. 
2469            m_samp_max=int(1e5) 
2470            mvec=np.arange(m_samp_max)   
2471    
2472        for it,find in enumerate(find_vals):
2473            if noise_model==0:      
2474                m1=m_total*fvec[find]
2475                v1=m1+beta_mv*np.power(m1,alpha_mv)
2476                p=1-m1/v1
2477                n=m1*m1/v1/p
2478                Pm1_f=nbinom(n,1-p)
2479            
2480                Pm1_f_adj=np.exp(np.log(1-np.exp(-r_c1*mvec))+np.log(Pm1_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c1+np.log(1-p))/(np.exp(r_c1)-p),n)))) #adds m-dependent factor due to conditioning on n>0...
2481                Pm1_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm1_f_adj/np.sum(Pm1_f_adj)))
2482                qx0_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm1_f_adj_obj.rvs(size=f_counts[it])
2483            
2484                mvals,minds,m_counts=np.unique(qx0_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True)
2485                for mit,m in enumerate(mvals):
2486                    Pn1_m1=poisson(r_c1*m)
2487                    samples=np.random.random(size=m_counts[mit]) * (1-Pn1_m1.cdf(0)) + Pn1_m1.cdf(0)
2488                    qx0_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn1_m1.ppf(samples)
2489 
2490        
2491            elif noise_model>0:
2492                samples=np.random.random(size=f_counts[it]) * (1-Pn1_f[find].cdf(0)) + Pn1_f[find].cdf(0)
2493                qx0_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn1_f[find].ppf(samples)
2494            else:
2495                print('acq_model is 0,1, or 2 only')
2496        qx0_pair_samples=np.hstack((qx0_samples[:,np.newaxis],np.zeros((num_qx0,1)))) 
2497    
2498        #0x
2499        integ=np.exp(np.log(Pf_q0x)+logfvec)
2500        f_samples_inds=self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_q0x).flatten()
2501        f_sorted_inds=np.argsort(f_samples_inds)
2502        f_samples_inds=f_samples_inds[f_sorted_inds] 
2503        q0x_f_samples=fvec[f_samples_inds]
2504        find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True)
2505        q0x_samples=np.zeros((num_q0x,))
2506        if noise_model<1:
2507            q0x_m_samples=np.zeros((num_q0x,))
2508        for it,find in enumerate(find_vals):
2509            if noise_model==0:
2510                m2=m_total*fvec[find]
2511                v2=m2+beta_mv*np.power(m2,alpha_mv)
2512                p=1-m2/v2
2513                n=m2*m2/v2/p
2514                Pm2_f=nbinom(n,1-p)
2515            
2516                Pm2_f_adj=np.exp(np.log(1-np.exp(-r_c2*mvec))+np.log(Pm2_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c2+np.log(1-p))/(np.exp(r_c2)-p),n)))) #adds m-dependent factor due to conditioning on n>0...
2517                Pm2_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm2_f_adj/np.sum(Pm2_f_adj)))
2518                q0x_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm2_f_adj_obj.rvs(size=f_counts[it])
2519
2520                mvals,minds,m_counts=np.unique(q0x_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True)
2521                for mit,m in enumerate(mvals):
2522                    Pn2_m2=poisson(r_c2*m)
2523                    samples=np.random.random(size=m_counts[mit]) * (1-Pn2_m2.cdf(0)) + Pn2_m2.cdf(0)
2524                    q0x_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn2_m2.ppf(samples)
2525        
2526                       
2527        
2528            elif noise_model > 0:
2529                samples=np.random.random(size=f_counts[it]) * (1-Pn2_f[find].cdf(0)) + Pn2_f[find].cdf(0)
2530                q0x_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn2_f[find].ppf(samples)
2531            else:
2532                print('acq_model is 0,1,or 2 only')
2533        q0x_pair_samples=np.hstack((np.zeros((num_q0x,1)),q0x_samples[:,np.newaxis]))
2534    
2535        #qxx
2536        integ=np.exp(np.log(Pf_qxx)+logfvec)
2537        f_samples_inds=self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_qxx).flatten()        
2538        f_sorted_inds=np.argsort(f_samples_inds)
2539        f_samples_inds=f_samples_inds[f_sorted_inds] 
2540        qxx_f_samples=fvec[f_samples_inds]
2541        find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True)
2542        qxx_n1_samples=np.zeros((num_qxx,))
2543        qxx_n2_samples=np.zeros((num_qxx,))
2544        if noise_model<1:
2545            qxx_m1_samples=np.zeros((num_qxx,))
2546            qxx_m2_samples=np.zeros((num_qxx,))
2547        for it,find in enumerate(find_vals):
2548            if noise_model==0:
2549                m1=m_total*fvec[find]
2550                v1=m1+beta_mv*np.power(m1,alpha_mv)
2551                p=1-m1/v1
2552                n=m1*m1/v1/p
2553                Pm1_f=nbinom(n,1-p)
2554            
2555                Pm1_f_adj=np.exp(np.log(1-np.exp(-r_c1*mvec))+np.log(Pm1_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c1+np.log(1-p))/(np.exp(r_c1)-p),n)))) #adds m-dependent factor due to conditioning on n>0...
2556                if np.sum(Pm1_f_adj)==0:
2557                    qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=1
2558                else:
2559                    Pm1_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm1_f_adj/np.sum(Pm1_f_adj)))
2560                    qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm1_f_adj_obj.rvs(size=f_counts[it])
2561
2562                mvals,minds,m_counts=np.unique(qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True)
2563                for mit,m in enumerate(mvals):
2564                    Pn1_m1=poisson(r_c1*m)
2565                    samples=np.random.random(size=m_counts[mit]) * (1-Pn1_m1.cdf(0)) + Pn1_m1.cdf(0)
2566                    qxx_n1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn1_m1.ppf(samples)
2567                
2568                m2=m_total*fvec[find]
2569                v2=m2+beta_mv*np.power(m2,alpha_mv)
2570                p=1-m2/v2
2571                n=m2*m2/v2/p
2572                Pm2_f=nbinom(n,1-p)
2573            
2574                Pm2_f_adj=np.exp(np.log(1-np.exp(-r_c2*mvec))+np.log(Pm2_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c2+np.log(1-p))/(np.exp(r_c2)-p),n)))) #adds m-dependent factor due to conditioning on n>0...
2575                if np.sum(Pm1_f_adj)==0:
2576                    qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=1
2577                else:
2578                    Pm2_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm2_f_adj/np.sum(Pm2_f_adj)))
2579                    qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm2_f_adj_obj.rvs(size=f_counts[it])
2580
2581                mvals,minds,m_counts=np.unique(qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True)
2582                for mit,m in enumerate(mvals):
2583                    Pn2_m2=poisson(r_c2*m)
2584                    samples=np.random.random(size=m_counts[mit]) * (1-Pn2_m2.cdf(0)) + Pn2_m2.cdf(0)
2585                    qxx_n2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn2_m2.ppf(samples)    
2586
2587                          
2588            elif noise_model>0:
2589                samples=np.random.random(size=f_counts[it]) * (1-Pn1_f[find].cdf(0)) + Pn1_f[find].cdf(0)
2590                qxx_n1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn1_f[find].ppf(samples)
2591                samples=np.random.random(size=f_counts[it]) * (1-Pn2_f[find].cdf(0)) + Pn2_f[find].cdf(0)
2592                qxx_n2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn2_f[find].ppf(samples)
2593            else:
2594                print('acq_model is 0,1, or 2 only')
2595            
2596        qxx_pair_samples=np.hstack((qxx_n1_samples[:,np.newaxis],qxx_n2_samples[:,np.newaxis]))
2597    
2598        pair_samples=np.vstack((q0x_pair_samples,qx0_pair_samples,qxx_pair_samples))
2599        f_samples=np.concatenate((q0x_f_samples,qx0_f_samples,qxx_f_samples))
2600        output_m_samples=False
2601        if noise_model<1 and output_m_samples:                
2602            m1_samples=np.concatenate((q0x_m1_samples,qx0_m1_samples,qxx_m1_samples))
2603            m2_samples=np.concatenate((q0x_m2_samples,qx0_m2_samples,qxx_m2_samples))
2604    
2605        pair_samples_df=pd.DataFrame({'Clone_count_1':pair_samples[:,0],'Clone_count_2':pair_samples[:,1]})
2606
2607        pair_samples_df['Clone_fraction_1'] = pair_samples_df['Clone_count_1']/np.sum(pair_samples_df['Clone_count_1'])
2608        pair_samples_df['Clone_fraction_2'] = pair_samples_df['Clone_count_2']/np.sum(pair_samples_df['Clone_count_2'])
2609    
2610        return f_samples,pair_samples_df
2611
2612
2613    def generate_trajectories(self, tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI = '1e6', NreadsII = '1e6'):
2614
2615
2616        """
2617        generate in-silico t_ime apart RepSeq samples.
2618        
2619        Parameters
2620        ----------
2621        paras_1  : numpy array
2622            parameters of the noise model that has been learnt at time_1
2623        paras_2  : numpy array
2624            parameters of the noise model that has been learnt at time_2
2625        method   : str
2626            'negative_binomial' or 'poisson'
2627        tau      : float
2628            first time-scale parameter of the dynamics
2629        theta    : float
2630            second time-scale parameter of the dynamics
2631        t_ime    : float
2632            number of years between both synthetic sampling (between time_1 and time_2)
2633        filename : str
2634            name of the file in which the dataframe is stored  
2635        Returns
2636        -------
2637        data-frame - csv file
2638            the output is a csv file of columns : 'Clone_count_1' (at time_1) 'Clone_count_2' (at time_2) and the frequency counterparts 'Clone_frequency_1' and 'Clone_frequency_2'
2639        """
2640
2641        np.seterr(divide = 'ignore') 
2642        np.warnings.filterwarnings('ignore')
2643
2644        method = 'negative_binomial'
2645
2646
2647        # Synthetic data generation
2648
2649        print('execution starting...')
2650
2651        st = time.time()
2652
2653        #Values of the parameters
2654        A = -1/tau
2655        B = 1/theta
2656        N_0 = 40
2657        NreadsI = float(NreadsI)
2658        NreadsII = float(NreadsII)
2659
2660        t = float(t_ime)
2661
2662        if NreadsI == NreadsII:
2663            key_sym = '_sym_'
2664
2665        else:
2666            key_sym = '_asym_'
2667
2668        # Name of the directory
2669
2670
2671        dirName = 'output'    
2672        os.makedirs(dirName, exist_ok=True) 
2673
2674        paras = paras_1 #Just put a and b of the negative binomiale distribution [0.7, 1.1]
2675        alpha = -1 +2*A/B
2676        #print('alpha : ' + str(alpha))
2677
2678        #1/ Generate log-population at initial time from steady-state distribution + GBM diffusion trajectories for 2 years
2679        x_i_LB, x_f_LB, Prop_Matrix_LB, p_ext_LB, results_extinction_LB, time_vec_LB, results_extinction_source_LB, x_source_LB = _generator_diffusion_LB(A, B, N_0, t)
2680        
2681        #x_i_LB, x_f_LB, Prop_Matrix, p_ext, results_extinction  = generator_diffusion_LB(B, A, N_0, t)
2682        N_cells_day_0_LB, N_cells_day_1_LB = np.sum(np.exp(x_i_LB)), np.sum(np.exp(x_f_LB)) + np.sum(np.exp(x_source_LB))  #N_cells_final_LB
2683        print('NUMBER OF CELLS AT INITIAL TIME')
2684        print(N_cells_day_0_LB)
2685
2686        print('NUMBER OF CELLS AT FINAL TIME')
2687        print(N_cells_day_1_LB)
2688
2689        #print('SHAPE_X_I ' +  str(np.shape(x_i_LB)))
2690        #print('SHAPE_X_F ' +  str(np.shape(x_f_LB)))
2691
2692
2693        if method == 'negative_binomial':
2694
2695            df_diffusion_LB  = _experimental_sampling_diffusion_NegBin(NreadsI, NreadsII, paras, x_i_LB, x_f_LB, N_cells_day_0_LB, N_cells_day_1_LB)
2696            df_diffusion_LB.to_csv(filename + '.csv' , sep= '\t')
2697
2698        elif method == 'poisson': 
2699
2700            df_diffusion_LB  = _experimental_sampling_diffusion_Poisson(NreadsI, NreadsII, x_i_LB, x_f_LB, t, N_cells_day_0_LB, N_cells_day_1_LB)
2701            df_diffusion_LB.to_csv(filename + '.csv' , sep= '\t')

A class used to build an object to generate in-Silico (synthetic) RepSeq samples, in the case of replicates at the same day and in the case of having 2 samples generated at an initial time for the first one and some time after (months, years) for the second one using the geometric Brownian motion model decribed in https://www.biorxiv.org/content/10.1101/2022.05.01.490247v1. ...

Methods

gen_synthetic_data_Null(paras, noise_model, NreadsI,NreadsII,Nsamp): generate in-silico same day RepSeq replicates. generate_trajectories(tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI = '1e6', NreadsII = '1e6'): generate in-silico t_ime apart RepSeq samples.

Generator()
def gen_synthetic_data_Null(self, paras, noise_model, NreadsI, NreadsII, Nsamp):
2344    def gen_synthetic_data_Null(self, paras, noise_model, NreadsI,NreadsII,Nsamp):
2345        '''
2346        outputs an array of observed clone frequencies and corresponding dataframe of pair counts
2347        for a null model learned from a dataset pair with NreadsI and NreadsII number of reads, respectively.
2348        Crucial for RAM efficiency, sampling is conditioned on being observed in each of the three (n,0), (0,n'), and n,n'>0 conditions
2349        so that only Nsamp clones need to be sampled, rather than the N clones in the repertoire.
2350        Note that no explicit normalization is applied. It is assumed that the values in paras are consistent with N<f>=1 
2351        (e.g. were obtained through the learning done in this package).
2352        '''
2353
2354    
2355        alpha = paras[0] #power law exponent
2356        fmin=np.power(10,paras[-1])
2357        if noise_model<1:
2358            m_total=float(np.power(10, paras[3])) 
2359            r_c1=NreadsI/m_total
2360            r_c2=NreadsII/m_total
2361            r_cvec=[r_c1,r_c2]
2362        if noise_model<2:
2363            beta_mv= paras[1]
2364            alpha_mv=paras[2]
2365    
2366        logrhofvec,logfvec = self.get_rhof(alpha,fmin)
2367        fvec=np.exp(logfvec)
2368        dlogf=np.diff(logfvec)/2.
2369    
2370        #generate measurement model distribution, Pn_f
2371        Pn_f=np.empty((len(logfvec),),dtype=object) #len(logfvec) samplers
2372    
2373        #get value at n=0 to use for conditioning on n>0 (and get full Pn_f here if noise_model=1,2)
2374        m_max=1e3 #conditioned on n=0, so no edge effects
2375    
2376        Nreadsvec=(NreadsI,NreadsII)
2377        for it in range(2):
2378            Pn_f=np.empty((len(fvec),),dtype=object)
2379            if noise_model==2:
2380                m1vec=Nreadsvec[it]*fvec
2381                for find,m1 in enumerate(m1vec):
2382                    Pn_f[find]=poisson(m1)
2383                logPn0_f=-m1vec
2384            elif noise_model==1:
2385                m1=Nreadsvec[it]*fvec
2386                v1=m1+beta_mv*np.power(m1,alpha_mv)
2387                p=1-m1/v1
2388                n=m1*m1/v1/p
2389                for find,(n,p) in enumerate(zip(n,p)):
2390                    Pn_f[find]=nbinom(n,1-p)
2391                Pn0_f=np.asarray([Pn_find.pmf(0) for Pn_find in Pn_f])
2392                logPn0_f=np.log(Pn0_f)
2393            
2394            elif noise_model==0:
2395                m1=m_total*fvec
2396                v1=m1+beta_mv*np.power(m1,alpha_mv)
2397                p=1-m1/v1
2398                n=m1*m1/v1/p
2399                Pn0_f=np.zeros((len(fvec),))
2400                for find in range(len(Pn0_f)):
2401                    nbtmp=nbinom(n[find],1-p[find]).pmf(np.arange(m_max+1))
2402                    ptmp=poisson(r_cvec[it]*np.arange(m_max+1)).pmf(0)
2403                    Pn0_f[find]=np.sum(np.exp(np.log(nbtmp)+np.log(ptmp)))
2404                logPn0_f=np.log(Pn0_f)
2405            else:
2406                print('acq_model is 0,1,or 2 only')
2407            
2408            if it==0:
2409                Pn1_f=Pn_f
2410                logPn10_f=logPn0_f
2411            else:
2412                Pn2_f=Pn_f
2413                logPn20_f=logPn0_f
2414
2415        #3-quadrant q|f conditional distribution (qx0:n1>0,n2=0;q0x:n1=0,n2>0;qxx:n1,n2>0)
2416        logPqx0_f=np.log(1-np.exp(logPn10_f))+logPn20_f
2417        logPq0x_f=logPn10_f+np.log(1-np.exp(logPn20_f))
2418        logPqxx_f=np.log(1-np.exp(logPn10_f))+np.log(1-np.exp(logPn20_f))
2419        #3-quadrant q,f joint distribution
2420        logPfqx0=logPqx0_f+logrhofvec
2421        logPfq0x=logPq0x_f+logrhofvec
2422        logPfqxx=logPqxx_f+logrhofvec
2423        #3-quadrant q marginal distribution 
2424        Pqx0=np.trapz(np.exp(logPfqx0+logfvec),x=logfvec)
2425        Pq0x=np.trapz(np.exp(logPfq0x+logfvec),x=logfvec)
2426        Pqxx=np.trapz(np.exp(logPfqxx+logfvec),x=logfvec)
2427    
2428        #3 quadrant conditional f|q distribution
2429        Pf_qx0=np.where(Pqx0>0,np.exp(logPfqx0-np.log(Pqx0)),0)
2430        Pf_q0x=np.where(Pq0x>0,np.exp(logPfq0x-np.log(Pq0x)),0)
2431        Pf_qxx=np.where(Pqxx>0,np.exp(logPfqxx-np.log(Pqxx)),0)
2432    
2433        #3-quadrant q marginal distribution
2434        newPqZ=Pqx0 + Pq0x + Pqxx
2435        Pqx0/=newPqZ
2436        Pq0x/=newPqZ
2437        Pqxx/=newPqZ
2438
2439        Pfqx0=np.exp(logPfqx0)
2440        Pfq0x=np.exp(logPfq0x)
2441        Pfqxx=np.exp(logPfqxx)
2442    
2443        print('Model probs: '+str(Pqx0)+' '+str(Pq0x)+' '+str(Pqxx))
2444
2445        #get samples 
2446        num_samples=Nsamp
2447        q_samples=np.random.choice(range(3), num_samples, p=(Pqx0,Pq0x,Pqxx))
2448        vals,counts=np.unique(q_samples,return_counts=True)
2449        num_qx0=counts[0]
2450        num_q0x=counts[1]
2451        num_qxx=counts[2]
2452        print('q samples: '+str(sum(counts))+' '+str(num_qx0)+' '+str(num_q0x)+' '+str(num_qxx))
2453        print('q sampled probs: '+str(num_qx0/float(sum(counts)))+' '+str(num_q0x/float(sum(counts)))+' '+str(num_qxx/float(sum(counts))))
2454    
2455        #x0
2456        integ=np.exp(np.log(Pf_qx0)+logfvec)
2457        f_samples_inds= self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_qx0).flatten()
2458        f_sorted_inds=np.argsort(f_samples_inds)
2459        f_samples_inds=f_samples_inds[f_sorted_inds] 
2460        qx0_f_samples=fvec[f_samples_inds]
2461        find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True)
2462        qx0_samples=np.zeros((num_qx0,))
2463        if noise_model<1:
2464            qx0_m_samples=np.zeros((num_qx0,))
2465            #conditioning on n>0 applies an m-dependent factor to Pm_f, which can't be incorporated into the ppf method used for noise_model 1 and 2. 
2466            #We handle that here by using a custom finite range sampler, which has the drawback of having to define an upper limit. 
2467            #This works so long as n_max/r_c<<m_max, so depends on highest counts in data (n_max). My data had max counts of 1e3-1e4.
2468            #Alternatively, could define a custom scipy RV class by defining it's PMF, but has to be array-compatible which requires care. 
2469            m_samp_max=int(1e5) 
2470            mvec=np.arange(m_samp_max)   
2471    
2472        for it,find in enumerate(find_vals):
2473            if noise_model==0:      
2474                m1=m_total*fvec[find]
2475                v1=m1+beta_mv*np.power(m1,alpha_mv)
2476                p=1-m1/v1
2477                n=m1*m1/v1/p
2478                Pm1_f=nbinom(n,1-p)
2479            
2480                Pm1_f_adj=np.exp(np.log(1-np.exp(-r_c1*mvec))+np.log(Pm1_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c1+np.log(1-p))/(np.exp(r_c1)-p),n)))) #adds m-dependent factor due to conditioning on n>0...
2481                Pm1_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm1_f_adj/np.sum(Pm1_f_adj)))
2482                qx0_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm1_f_adj_obj.rvs(size=f_counts[it])
2483            
2484                mvals,minds,m_counts=np.unique(qx0_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True)
2485                for mit,m in enumerate(mvals):
2486                    Pn1_m1=poisson(r_c1*m)
2487                    samples=np.random.random(size=m_counts[mit]) * (1-Pn1_m1.cdf(0)) + Pn1_m1.cdf(0)
2488                    qx0_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn1_m1.ppf(samples)
2489 
2490        
2491            elif noise_model>0:
2492                samples=np.random.random(size=f_counts[it]) * (1-Pn1_f[find].cdf(0)) + Pn1_f[find].cdf(0)
2493                qx0_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn1_f[find].ppf(samples)
2494            else:
2495                print('acq_model is 0,1, or 2 only')
2496        qx0_pair_samples=np.hstack((qx0_samples[:,np.newaxis],np.zeros((num_qx0,1)))) 
2497    
2498        #0x
2499        integ=np.exp(np.log(Pf_q0x)+logfvec)
2500        f_samples_inds=self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_q0x).flatten()
2501        f_sorted_inds=np.argsort(f_samples_inds)
2502        f_samples_inds=f_samples_inds[f_sorted_inds] 
2503        q0x_f_samples=fvec[f_samples_inds]
2504        find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True)
2505        q0x_samples=np.zeros((num_q0x,))
2506        if noise_model<1:
2507            q0x_m_samples=np.zeros((num_q0x,))
2508        for it,find in enumerate(find_vals):
2509            if noise_model==0:
2510                m2=m_total*fvec[find]
2511                v2=m2+beta_mv*np.power(m2,alpha_mv)
2512                p=1-m2/v2
2513                n=m2*m2/v2/p
2514                Pm2_f=nbinom(n,1-p)
2515            
2516                Pm2_f_adj=np.exp(np.log(1-np.exp(-r_c2*mvec))+np.log(Pm2_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c2+np.log(1-p))/(np.exp(r_c2)-p),n)))) #adds m-dependent factor due to conditioning on n>0...
2517                Pm2_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm2_f_adj/np.sum(Pm2_f_adj)))
2518                q0x_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm2_f_adj_obj.rvs(size=f_counts[it])
2519
2520                mvals,minds,m_counts=np.unique(q0x_m_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True)
2521                for mit,m in enumerate(mvals):
2522                    Pn2_m2=poisson(r_c2*m)
2523                    samples=np.random.random(size=m_counts[mit]) * (1-Pn2_m2.cdf(0)) + Pn2_m2.cdf(0)
2524                    q0x_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn2_m2.ppf(samples)
2525        
2526                       
2527        
2528            elif noise_model > 0:
2529                samples=np.random.random(size=f_counts[it]) * (1-Pn2_f[find].cdf(0)) + Pn2_f[find].cdf(0)
2530                q0x_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn2_f[find].ppf(samples)
2531            else:
2532                print('acq_model is 0,1,or 2 only')
2533        q0x_pair_samples=np.hstack((np.zeros((num_q0x,1)),q0x_samples[:,np.newaxis]))
2534    
2535        #qxx
2536        integ=np.exp(np.log(Pf_qxx)+logfvec)
2537        f_samples_inds=self._get_distsample(dlogf*(integ[1:] + integ[:-1]),num_qxx).flatten()        
2538        f_sorted_inds=np.argsort(f_samples_inds)
2539        f_samples_inds=f_samples_inds[f_sorted_inds] 
2540        qxx_f_samples=fvec[f_samples_inds]
2541        find_vals,f_start_ind,f_counts=np.unique(f_samples_inds,return_counts=True,return_index=True)
2542        qxx_n1_samples=np.zeros((num_qxx,))
2543        qxx_n2_samples=np.zeros((num_qxx,))
2544        if noise_model<1:
2545            qxx_m1_samples=np.zeros((num_qxx,))
2546            qxx_m2_samples=np.zeros((num_qxx,))
2547        for it,find in enumerate(find_vals):
2548            if noise_model==0:
2549                m1=m_total*fvec[find]
2550                v1=m1+beta_mv*np.power(m1,alpha_mv)
2551                p=1-m1/v1
2552                n=m1*m1/v1/p
2553                Pm1_f=nbinom(n,1-p)
2554            
2555                Pm1_f_adj=np.exp(np.log(1-np.exp(-r_c1*mvec))+np.log(Pm1_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c1+np.log(1-p))/(np.exp(r_c1)-p),n)))) #adds m-dependent factor due to conditioning on n>0...
2556                if np.sum(Pm1_f_adj)==0:
2557                    qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=1
2558                else:
2559                    Pm1_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm1_f_adj/np.sum(Pm1_f_adj)))
2560                    qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm1_f_adj_obj.rvs(size=f_counts[it])
2561
2562                mvals,minds,m_counts=np.unique(qxx_m1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True)
2563                for mit,m in enumerate(mvals):
2564                    Pn1_m1=poisson(r_c1*m)
2565                    samples=np.random.random(size=m_counts[mit]) * (1-Pn1_m1.cdf(0)) + Pn1_m1.cdf(0)
2566                    qxx_n1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn1_m1.ppf(samples)
2567                
2568                m2=m_total*fvec[find]
2569                v2=m2+beta_mv*np.power(m2,alpha_mv)
2570                p=1-m2/v2
2571                n=m2*m2/v2/p
2572                Pm2_f=nbinom(n,1-p)
2573            
2574                Pm2_f_adj=np.exp(np.log(1-np.exp(-r_c2*mvec))+np.log(Pm2_f.pmf(mvec))-np.log((1-np.power(np.exp(r_c2+np.log(1-p))/(np.exp(r_c2)-p),n)))) #adds m-dependent factor due to conditioning on n>0...
2575                if np.sum(Pm1_f_adj)==0:
2576                    qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=1
2577                else:
2578                    Pm2_f_adj_obj=rv_discrete(name='nbinom_adj',values=(mvec,Pm2_f_adj/np.sum(Pm2_f_adj)))
2579                    qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pm2_f_adj_obj.rvs(size=f_counts[it])
2580
2581                mvals,minds,m_counts=np.unique(qxx_m2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]],return_inverse=True,return_counts=True)
2582                for mit,m in enumerate(mvals):
2583                    Pn2_m2=poisson(r_c2*m)
2584                    samples=np.random.random(size=m_counts[mit]) * (1-Pn2_m2.cdf(0)) + Pn2_m2.cdf(0)
2585                    qxx_n2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]][minds==mit]=Pn2_m2.ppf(samples)    
2586
2587                          
2588            elif noise_model>0:
2589                samples=np.random.random(size=f_counts[it]) * (1-Pn1_f[find].cdf(0)) + Pn1_f[find].cdf(0)
2590                qxx_n1_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn1_f[find].ppf(samples)
2591                samples=np.random.random(size=f_counts[it]) * (1-Pn2_f[find].cdf(0)) + Pn2_f[find].cdf(0)
2592                qxx_n2_samples[f_start_ind[it]:f_start_ind[it]+f_counts[it]]=Pn2_f[find].ppf(samples)
2593            else:
2594                print('acq_model is 0,1, or 2 only')
2595            
2596        qxx_pair_samples=np.hstack((qxx_n1_samples[:,np.newaxis],qxx_n2_samples[:,np.newaxis]))
2597    
2598        pair_samples=np.vstack((q0x_pair_samples,qx0_pair_samples,qxx_pair_samples))
2599        f_samples=np.concatenate((q0x_f_samples,qx0_f_samples,qxx_f_samples))
2600        output_m_samples=False
2601        if noise_model<1 and output_m_samples:                
2602            m1_samples=np.concatenate((q0x_m1_samples,qx0_m1_samples,qxx_m1_samples))
2603            m2_samples=np.concatenate((q0x_m2_samples,qx0_m2_samples,qxx_m2_samples))
2604    
2605        pair_samples_df=pd.DataFrame({'Clone_count_1':pair_samples[:,0],'Clone_count_2':pair_samples[:,1]})
2606
2607        pair_samples_df['Clone_fraction_1'] = pair_samples_df['Clone_count_1']/np.sum(pair_samples_df['Clone_count_1'])
2608        pair_samples_df['Clone_fraction_2'] = pair_samples_df['Clone_count_2']/np.sum(pair_samples_df['Clone_count_2'])
2609    
2610        return f_samples,pair_samples_df

outputs an array of observed clone frequencies and corresponding dataframe of pair counts for a null model learned from a dataset pair with NreadsI and NreadsII number of reads, respectively. Crucial for RAM efficiency, sampling is conditioned on being observed in each of the three (n,0), (0,n'), and n,n'>0 conditions so that only Nsamp clones need to be sampled, rather than the N clones in the repertoire. Note that no explicit normalization is applied. It is assumed that the values in paras are consistent with N=1 (e.g. were obtained through the learning done in this package).

def generate_trajectories( self, tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI='1e6', NreadsII='1e6'):
2613    def generate_trajectories(self, tau, theta, method, paras_1, paras_2, t_ime, filename, NreadsI = '1e6', NreadsII = '1e6'):
2614
2615
2616        """
2617        generate in-silico t_ime apart RepSeq samples.
2618        
2619        Parameters
2620        ----------
2621        paras_1  : numpy array
2622            parameters of the noise model that has been learnt at time_1
2623        paras_2  : numpy array
2624            parameters of the noise model that has been learnt at time_2
2625        method   : str
2626            'negative_binomial' or 'poisson'
2627        tau      : float
2628            first time-scale parameter of the dynamics
2629        theta    : float
2630            second time-scale parameter of the dynamics
2631        t_ime    : float
2632            number of years between both synthetic sampling (between time_1 and time_2)
2633        filename : str
2634            name of the file in which the dataframe is stored  
2635        Returns
2636        -------
2637        data-frame - csv file
2638            the output is a csv file of columns : 'Clone_count_1' (at time_1) 'Clone_count_2' (at time_2) and the frequency counterparts 'Clone_frequency_1' and 'Clone_frequency_2'
2639        """
2640
2641        np.seterr(divide = 'ignore') 
2642        np.warnings.filterwarnings('ignore')
2643
2644        method = 'negative_binomial'
2645
2646
2647        # Synthetic data generation
2648
2649        print('execution starting...')
2650
2651        st = time.time()
2652
2653        #Values of the parameters
2654        A = -1/tau
2655        B = 1/theta
2656        N_0 = 40
2657        NreadsI = float(NreadsI)
2658        NreadsII = float(NreadsII)
2659
2660        t = float(t_ime)
2661
2662        if NreadsI == NreadsII:
2663            key_sym = '_sym_'
2664
2665        else:
2666            key_sym = '_asym_'
2667
2668        # Name of the directory
2669
2670
2671        dirName = 'output'    
2672        os.makedirs(dirName, exist_ok=True) 
2673
2674        paras = paras_1 #Just put a and b of the negative binomiale distribution [0.7, 1.1]
2675        alpha = -1 +2*A/B
2676        #print('alpha : ' + str(alpha))
2677
2678        #1/ Generate log-population at initial time from steady-state distribution + GBM diffusion trajectories for 2 years
2679        x_i_LB, x_f_LB, Prop_Matrix_LB, p_ext_LB, results_extinction_LB, time_vec_LB, results_extinction_source_LB, x_source_LB = _generator_diffusion_LB(A, B, N_0, t)
2680        
2681        #x_i_LB, x_f_LB, Prop_Matrix, p_ext, results_extinction  = generator_diffusion_LB(B, A, N_0, t)
2682        N_cells_day_0_LB, N_cells_day_1_LB = np.sum(np.exp(x_i_LB)), np.sum(np.exp(x_f_LB)) + np.sum(np.exp(x_source_LB))  #N_cells_final_LB
2683        print('NUMBER OF CELLS AT INITIAL TIME')
2684        print(N_cells_day_0_LB)
2685
2686        print('NUMBER OF CELLS AT FINAL TIME')
2687        print(N_cells_day_1_LB)
2688
2689        #print('SHAPE_X_I ' +  str(np.shape(x_i_LB)))
2690        #print('SHAPE_X_F ' +  str(np.shape(x_f_LB)))
2691
2692
2693        if method == 'negative_binomial':
2694
2695            df_diffusion_LB  = _experimental_sampling_diffusion_NegBin(NreadsI, NreadsII, paras, x_i_LB, x_f_LB, N_cells_day_0_LB, N_cells_day_1_LB)
2696            df_diffusion_LB.to_csv(filename + '.csv' , sep= '\t')
2697
2698        elif method == 'poisson': 
2699
2700            df_diffusion_LB  = _experimental_sampling_diffusion_Poisson(NreadsI, NreadsII, x_i_LB, x_f_LB, t, N_cells_day_0_LB, N_cells_day_1_LB)
2701            df_diffusion_LB.to_csv(filename + '.csv' , sep= '\t')

generate in-silico t_ime apart RepSeq samples.

Parameters
  • paras_1 (numpy array): parameters of the noise model that has been learnt at time_1
  • paras_2 (numpy array): parameters of the noise model that has been learnt at time_2
  • method (str): 'negative_binomial' or 'poisson'
  • tau (float): first time-scale parameter of the dynamics
  • theta (float): second time-scale parameter of the dynamics
  • t_ime (float): number of years between both synthetic sampling (between time_1 and time_2)
  • filename (str): name of the file in which the dataframe is stored
Returns
  • data-frame - csv file: the output is a csv file of columns : 'Clone_count_1' (at time_1) 'Clone_count_2' (at time_2) and the frequency counterparts 'Clone_frequency_1' and 'Clone_frequency_2'