How to build and clean data
The following are detailed tutorials on how to build and clean data.
Use integrated response data
Note
If you use your own response data, ignore this section.
Use Data.DrData as follows:
DrData(pair_ls=...,
cell_ft=...,
drug_ft=...)
Section 1.1: set pair_ls
For pair_ls, you just need to set it to the return value of Data.DrRead.PairDef,
click here for details.
Section 1.2: set cell_ft
For cell_ft (str or dict), the following settings are available:
Parameter setting |
Corresponding file |
|---|---|
cell_ft=”EXP” |
GDSC_EXP.pkl |
cell_ft=”PES” |
GDSC_PES.pkl |
cell_ft=”MUT” |
GDSC_MUT.pkl |
cell_ft=”CNV” |
GDSC_CNV.pkl |
cell_ft=cell_dict |
None |
In the last row of the table, cell_dict could be CellFeat got by Data.DrRead.FeatCell, or your own dict, where the key is the cell name and the value is the feature vector, e.g. VAE_dict.pkl.
If you set cell_ft to "EXP", "PES", "MUT", "CNV",
pairs lacking cell data will be removed based on the corresponding integrated file when you use .clean.
If you set cell_ft to cell_dict,
pairs lacking cell data will be removed based on the dict you set when you use .clean.
Section 1.3: set drug_ft
For drug_ft (str or dict), the following settings are available:
Parameter setting |
Corresponding file |
|---|---|
drug_ft=”ECFP” |
SMILES_dict.pkl |
drug_ft=”SMILES” |
SMILES_dict.pkl |
drug_ft=”Graph” |
SMILES_dict.pkl |
drug_ft=”Image” |
SMILES_dict.pkl |
drug_ft=drug_dict |
None |
In the last row of the table, drug_dict could be your own dict, where the key is the drug name and the value is the feature vector, e.g. SMILESVec_dict.pkl.
If you set drug_ft to "ECFP", "SMILES", "Graph", "Image",
pairs lacking drug data will be removed based on the corresponding integrated file when you use .clean.
If you set drug_ft to drug_dict,
pairs lacking drug data will be removed based on the dict you set when you use .clean.
Use your own response data
Note
If you use integrated response data, ignore this section.
Use Data.DrData as follows:
DrData(pair_ls=...,
cell_ft=...,
drug_ft=...,
smiles_dict=...,
mpg_dict=...)
Section 2.1: set pair_ls
For pair_ls, you just need to set it to the return value of Data.DrRead.PairCSV,
click here for details.
Section 2.2: set cell_ft
The setting of this parameter is the same as that in section 1.2.
Section 2.3: set drug_ft, smiles_dict, mpg_dict
Use integrated drug feature
For drug_ft (str), "ECFP", "SMILES", "Graph", "Image" are available.
For smiles_dict (dict), it should be SMILES_dict got by Data.DrRead.FeatDrug.
For mpg_dict (dict or None, optional, default: None),
if you want to use MPG (frozen) as the drug encoder, it should be MPG_dict got by Data.DrRead.FeatDrug,
and if you want to use other drug encoders, you just need to use the default value.
Pairs lacking drug data will be removed based on the smiles_dict you set when you use .clean.
Use your own drug feature
For drug_ft (dict), it should be your own dict drug_dict, where the key is the drug name and the value is the feature vector, e.g. SMILESVec_dict.pkl.
For smiles_dict (dict or None, optional, default: None), you just need to use the default value.
For mpg_dict (dict or None, optional, default: None), you just need to use the default value.
Pairs lacking drug data will be removed based on the drug_ft you set when you use .clean.
Clean the response data
To clean response data, use .clean as follows, where data is the instantiated Data.DrData:
data.clean(cell_ft_ls=...)
For cell_ft_ls (list or None, optional, default: None), usually you just need to use the default value, which will remove pairs lacking cell or drug data based on the setting of cell_ft and drug_ft. For detailed removal rules, see Sections 1.2 and 1.3 or Sections 2.2 and 2.3 above.
If you want to build the benchmark, cell_ft_ls needs to be set as a list, and each element in the list has the same form as cell_ft, which will additionally remove pairs lacking cell data based on each element in the list. The detailed removal rules are the same as the cell_ft based removal rules in Section 1.2 or Section 2.2 above.