{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"\n",
"0. Dependencies\n",
"1. Reproducing the Paper\n",
"2. Training the Model on Custom Datasets\n",
"3. Inference using Trained Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 0. Dependencies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Python Packages\n",
"The first step after cloning this repository is download and install the necessary python libraries/packages. Install the required packages by running the following cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -r requirements.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 1. Reproducing the Paper"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 Download Data\n",
"The next step is to download the data used to train/evaluate the models. Running the following command will download all 3 datasets, and convert their encodings so that they can be used by PeptideBERT."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python ./data/download_data.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 Train-Val-Test Split\n",
"Now, we want to combine the positive and negative samples (downloaded by the above cell), shuffle them and split them into 3 non-overlapping sets - train, validation, and test.\n",
"\n",
"To do so, run the following cell, this will create sub-directories (inside the `data` directory) for each dataset and place the subsets (train, validation, test) inside it.\n",
"\n",
"Additionally, if you want to augment any dataset, you can do so by editing `./data/split_augment.py` file. You can call the `augment_data` function from the `main` function with the dataset that you want to augment. For example, if you want to augment the `solubility` dataset, you can add `augment_data('sol')` to the `main` function.\n",
"\n",
"Further, to change/experiment with the augmentation techniques applied, you can edit the `augment_data` function. Comment/uncomment the call to any of the augmentation functions (such as `random_replace`, `random_delete`, etc.) as desired, change the factor for augmentation as desired. Do keep in mind that for each augmentation applied, you have to call the `combine` function. For example, if you want to apply the `random_swap` augmentation with a `factor` of 0.2, you can add `new_inputs, new_labels = random_swap(inputs, labels, 0.2)` followed by `inputs, labels = combine(inputs, labels, new_inputs, new_labels)` to merge the augmented dataset into the original dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python ./data/split_augment.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.3 Model Config\n",
"Edit the `config.yaml` file and set the `task` parameter to one of `hemo` (for hemolysis dataset), `sol` (for solubility dataset), or `nf` (for non-fouling dataset) as desired.\n",
"\n",
"Additionally, If you want to tweak the model before training, you can do so by editing `./model/network.py` and `config.yaml` files. `./model/network.py` contains the actual architecture of the model as well as the optimizer and scheduler used to train the model. `config.yaml` contains all the hyperparameters used for training, as well as which dataset to train on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.4 Training\n",
"Now we are ready to train our model. Run the following cell to start the training procedure. This will save a checkpoint of the best model (on validation set) inside the `checkpoints` directory"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python train.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 2. Training the Model on Custom Datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Follow the cells below to train the model on custom datasets."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 Data Preparation\n",
"\n",
"create a `csv` file with the following format:\n",
"```csv\n",
"sequence,label\n",
"AAAAAAA,1\n",
"LLLLLLL,0\n",
"CCCCCCC,0\n",
"DDDDDDD,1\n",
"```\n",
"where `sequence` is the peptide sequence and `label` is the binary label (0 or 1). Save this file as `custom_data.csv` inside the `data` directory. Now, run the following cell (edit `task_name` as desired) to convert the `csv` file to the format required by PeptideBERT."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"task_name = 'REPLACE_WITH_TASK_NAME'\n",
"\n",
"# read data\n",
"seqs, labels = [], []\n",
"with open('./data/custom_data.csv', 'r') as f:\n",
" for line in f.readlines()[1:]:\n",
" seq, label = line.strip().split(',')\n",
" seqs.append(seq)\n",
" labels.append(int(label))\n",
"\n",
"MAX_LEN = max(map(len, seqs))\n",
"\n",
"# convert to tokens\n",
"mapping = dict(zip(\n",
" ['[PAD]','[UNK]','[CLS]','[SEP]','[MASK]','L',\n",
" 'A','G','V','E','S','I','K','R','D','T','P','N',\n",
" 'Q','F','Y','M','H','C','W'],\n",
" range(30)\n",
"))\n",
"\n",
"pos_data, neg_data = [], []\n",
"for i in range(len(seqs)):\n",
" seq = [mapping[c] for c in seqs[i]] \n",
" seq.extend([0] * (MAX_LEN - len(seq))) # padding to max length\n",
" if labels[i] == 1:\n",
" pos_data.append(seq)\n",
" else:\n",
" neg_data.append(seq)\n",
"\n",
"pos_data = np.array(pos_data)\n",
"neg_data = np.array(neg_data)\n",
"\n",
"np.savez(\n",
" f'./data/{task_name}-positive.npz',\n",
" arr_0=pos_data\n",
")\n",
"np.savez(\n",
" f'./data/{task_name}-negative.npz',\n",
" arr_0=neg_data\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Train-Val-Test Split\n",
"Now, we want to combine the positive and negative samples, shuffle them and split them into 3 non-overlapping sets - train, validation, and test.\n",
"\n",
"To do so, edit the `main` function inside `./data/split_augment.py` file (comment existing calls to `split_data` and add the line `split_data('REPLACE_WITH_TASK_NAME')`) and run the following cell, this will create sub-directories (inside the `data` directory) for the custom dataset and place the subsets (train, validation, test) inside it.\n",
"\n",
"Additionally, if you want to augment the dataset, you can do so by editing `./data/split_augment.py` file. You can call the `augment_data` function from the `main` function like so: `augment_data('REPLACE_WITH_TASK_NAME')`.\n",
"\n",
"Further, to change/experiment with the augmentation techniques applied, you can edit the `augment_data` function. Comment/uncomment the call to any of the augmentation functions (such as `random_replace`, `random_delete`, etc.) as desired, change the factor for augmentation as desired. Do keep in mind that for each augmentation applied, you have to call the `combine` function. For example, if you want to apply the `random_swap` augmentation with a `factor` of 0.2, you can add `new_inputs, new_labels = random_swap(inputs, labels, 0.2)` followed by `inputs, labels = combine(inputs, labels, new_inputs, new_labels)` to merge the augmented dataset into the original dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python ./data/split_augment.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 Model Config\n",
"Edit the `config.yaml` file and set the `task` parameter to `REPLACE_WITH_TASK_NAME`.\n",
"\n",
"Additionally, If you want to tweak the model before training, you can do so by editing `./model/network.py` and `config.yaml` files. `./model/network.py` contains the actual architecture of the model as well as the optimizer and scheduler used to train the model. `config.yaml` contains all the hyperparameters used for training, as well as which dataset to train on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 Training\n",
"Now we are ready to train our model. Run the following cell to start the training procedure. This will save a checkpoint of the best model (on validation set) inside the `checkpoints` directory"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python train.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 3. Inference using Trained Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.1 Load Trained Model\n",
"Load the trained model by running the following cell. Edit the `run_name` parameter to the name of the directory containing the trained model (inside the `checkpoints` directory)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import yaml\n",
"from model.network import create_model\n",
"\n",
"run_name = 'REPLACE_WITH_RUN_NAME'\n",
"device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
"\n",
"\n",
"config = yaml.load(open('./config.yaml', 'r'), Loader=yaml.FullLoader)\n",
"config['device'] = device\n",
"\n",
"model = create_model(config)\n",
"model.load_state_dict(torch.load(f'./checkpoints/{run_name}/model.pt')['model_state_dict'], strict=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2 Input Data\n",
"Create a text file containing peptide sequences in the following format:\n",
"```txt\n",
"AAAAAAA\n",
"LLLLLLL\n",
"CCCCCCC\n",
"DDDDDDD\n",
"```\n",
"where each line represents a peptide sequence. Save this file as `input.txt` inside the `data` directory and run the following cell. The corresponding predictions will be saved in `output.txt` file inside the `data` directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"seqs = []\n",
"with open('./data/input.txt', 'r') as f:\n",
" for line in f.readlines():\n",
" seq = line.strip()\n",
" seqs.append(seq)\n",
"\n",
"MAX_LEN = max(map(len, seqs))\n",
"\n",
"# convert to tokens\n",
"mapping = dict(zip(\n",
" ['[PAD]','[UNK]','[CLS]','[SEP]','[MASK]','L',\n",
" 'A','G','V','E','S','I','K','R','D','T','P','N',\n",
" 'Q','F','Y','M','H','C','W'],\n",
" range(30)\n",
"))\n",
"\n",
"for i in range(len(seqs)):\n",
" seqs[i] = [mapping[c] for c in seqs[i]] \n",
" seqs[i].extend([0] * (MAX_LEN - len(seqs[i]))) # padding to max length\n",
"\n",
"preds = []\n",
"with torch.inference_mode():\n",
" for i in range(len(seqs)):\n",
" input_ids = torch.tensor([seqs[i]]).to(device)\n",
" attention_mask = (input_ids != 0).float()\n",
" output = int(model(input_ids, attention_mask)[0] > 0.5)\n",
" print(output)\n",
" preds.append(output)\n",
"\n",
"with open('./data/output.txt', 'w') as f:\n",
" for pred in preds:\n",
" f.write(str(pred) + '\\n')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}