{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "\n", "0. Dependencies\n", "1. Reproducing the Paper\n", "2. Training the Model on Custom Datasets\n", "3. Inference using Trained Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 0. Dependencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Python Packages\n", "The first step after cloning this repository is download and install the necessary python libraries/packages. Install the required packages by running the following cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -r requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Reproducing the Paper" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Download Data\n", "The next step is to download the data used to train/evaluate the models. Running the following command will download all 3 datasets, and convert their encodings so that they can be used by PeptideBERT." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python ./data/download_data.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Train-Val-Test Split\n", "Now, we want to combine the positive and negative samples (downloaded by the above cell), shuffle them and split them into 3 non-overlapping sets - train, validation, and test.\n", "\n", "To do so, run the following cell, this will create sub-directories (inside the `data` directory) for each dataset and place the subsets (train, validation, test) inside it.\n", "\n", "Additionally, if you want to augment any dataset, you can do so by editing `./data/split_augment.py` file. You can call the `augment_data` function from the `main` function with the dataset that you want to augment. For example, if you want to augment the `solubility` dataset, you can add `augment_data('sol')` to the `main` function.\n", "\n", "Further, to change/experiment with the augmentation techniques applied, you can edit the `augment_data` function. Comment/uncomment the call to any of the augmentation functions (such as `random_replace`, `random_delete`, etc.) as desired, change the factor for augmentation as desired. Do keep in mind that for each augmentation applied, you have to call the `combine` function. For example, if you want to apply the `random_swap` augmentation with a `factor` of 0.2, you can add `new_inputs, new_labels = random_swap(inputs, labels, 0.2)` followed by `inputs, labels = combine(inputs, labels, new_inputs, new_labels)` to merge the augmented dataset into the original dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python ./data/split_augment.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3 Model Config\n", "Edit the `config.yaml` file and set the `task` parameter to one of `hemo` (for hemolysis dataset), `sol` (for solubility dataset), or `nf` (for non-fouling dataset) as desired.\n", "\n", "Additionally, If you want to tweak the model before training, you can do so by editing `./model/network.py` and `config.yaml` files. `./model/network.py` contains the actual architecture of the model as well as the optimizer and scheduler used to train the model. `config.yaml` contains all the hyperparameters used for training, as well as which dataset to train on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.4 Training\n", "Now we are ready to train our model. Run the following cell to start the training procedure. This will save a checkpoint of the best model (on validation set) inside the `checkpoints` directory" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python train.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Training the Model on Custom Datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Follow the cells below to train the model on custom datasets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Data Preparation\n", "\n", "create a `csv` file with the following format:\n", "```csv\n", "sequence,label\n", "AAAAAAA,1\n", "LLLLLLL,0\n", "CCCCCCC,0\n", "DDDDDDD,1\n", "```\n", "where `sequence` is the peptide sequence and `label` is the binary label (0 or 1). Save this file as `custom_data.csv` inside the `data` directory. Now, run the following cell (edit `task_name` as desired) to convert the `csv` file to the format required by PeptideBERT." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "task_name = 'REPLACE_WITH_TASK_NAME'\n", "\n", "# read data\n", "seqs, labels = [], []\n", "with open('./data/custom_data.csv', 'r') as f:\n", " for line in f.readlines()[1:]:\n", " seq, label = line.strip().split(',')\n", " seqs.append(seq)\n", " labels.append(int(label))\n", "\n", "MAX_LEN = max(map(len, seqs))\n", "\n", "# convert to tokens\n", "mapping = dict(zip(\n", " ['[PAD]','[UNK]','[CLS]','[SEP]','[MASK]','L',\n", " 'A','G','V','E','S','I','K','R','D','T','P','N',\n", " 'Q','F','Y','M','H','C','W'],\n", " range(30)\n", "))\n", "\n", "pos_data, neg_data = [], []\n", "for i in range(len(seqs)):\n", " seq = [mapping[c] for c in seqs[i]] \n", " seq.extend([0] * (MAX_LEN - len(seq))) # padding to max length\n", " if labels[i] == 1:\n", " pos_data.append(seq)\n", " else:\n", " neg_data.append(seq)\n", "\n", "pos_data = np.array(pos_data)\n", "neg_data = np.array(neg_data)\n", "\n", "np.savez(\n", " f'./data/{task_name}-positive.npz',\n", " arr_0=pos_data\n", ")\n", "np.savez(\n", " f'./data/{task_name}-negative.npz',\n", " arr_0=neg_data\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Train-Val-Test Split\n", "Now, we want to combine the positive and negative samples, shuffle them and split them into 3 non-overlapping sets - train, validation, and test.\n", "\n", "To do so, edit the `main` function inside `./data/split_augment.py` file (comment existing calls to `split_data` and add the line `split_data('REPLACE_WITH_TASK_NAME')`) and run the following cell, this will create sub-directories (inside the `data` directory) for the custom dataset and place the subsets (train, validation, test) inside it.\n", "\n", "Additionally, if you want to augment the dataset, you can do so by editing `./data/split_augment.py` file. You can call the `augment_data` function from the `main` function like so: `augment_data('REPLACE_WITH_TASK_NAME')`.\n", "\n", "Further, to change/experiment with the augmentation techniques applied, you can edit the `augment_data` function. Comment/uncomment the call to any of the augmentation functions (such as `random_replace`, `random_delete`, etc.) as desired, change the factor for augmentation as desired. Do keep in mind that for each augmentation applied, you have to call the `combine` function. For example, if you want to apply the `random_swap` augmentation with a `factor` of 0.2, you can add `new_inputs, new_labels = random_swap(inputs, labels, 0.2)` followed by `inputs, labels = combine(inputs, labels, new_inputs, new_labels)` to merge the augmented dataset into the original dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python ./data/split_augment.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Model Config\n", "Edit the `config.yaml` file and set the `task` parameter to `REPLACE_WITH_TASK_NAME`.\n", "\n", "Additionally, If you want to tweak the model before training, you can do so by editing `./model/network.py` and `config.yaml` files. `./model/network.py` contains the actual architecture of the model as well as the optimizer and scheduler used to train the model. `config.yaml` contains all the hyperparameters used for training, as well as which dataset to train on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 Training\n", "Now we are ready to train our model. Run the following cell to start the training procedure. This will save a checkpoint of the best model (on validation set) inside the `checkpoints` directory" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python train.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Inference using Trained Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Load Trained Model\n", "Load the trained model by running the following cell. Edit the `run_name` parameter to the name of the directory containing the trained model (inside the `checkpoints` directory)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "import yaml\n", "from model.network import create_model\n", "\n", "run_name = 'REPLACE_WITH_RUN_NAME'\n", "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", "\n", "\n", "config = yaml.load(open('./config.yaml', 'r'), Loader=yaml.FullLoader)\n", "config['device'] = device\n", "\n", "model = create_model(config)\n", "model.load_state_dict(torch.load(f'./checkpoints/{run_name}/model.pt')['model_state_dict'], strict=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Input Data\n", "Create a text file containing peptide sequences in the following format:\n", "```txt\n", "AAAAAAA\n", "LLLLLLL\n", "CCCCCCC\n", "DDDDDDD\n", "```\n", "where each line represents a peptide sequence. Save this file as `input.txt` inside the `data` directory and run the following cell. The corresponding predictions will be saved in `output.txt` file inside the `data` directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "seqs = []\n", "with open('./data/input.txt', 'r') as f:\n", " for line in f.readlines():\n", " seq = line.strip()\n", " seqs.append(seq)\n", "\n", "MAX_LEN = max(map(len, seqs))\n", "\n", "# convert to tokens\n", "mapping = dict(zip(\n", " ['[PAD]','[UNK]','[CLS]','[SEP]','[MASK]','L',\n", " 'A','G','V','E','S','I','K','R','D','T','P','N',\n", " 'Q','F','Y','M','H','C','W'],\n", " range(30)\n", "))\n", "\n", "for i in range(len(seqs)):\n", " seqs[i] = [mapping[c] for c in seqs[i]] \n", " seqs[i].extend([0] * (MAX_LEN - len(seqs[i]))) # padding to max length\n", "\n", "preds = []\n", "with torch.inference_mode():\n", " for i in range(len(seqs)):\n", " input_ids = torch.tensor([seqs[i]]).to(device)\n", " attention_mask = (input_ids != 0).float()\n", " output = int(model(input_ids, attention_mask)[0] > 0.5)\n", " print(output)\n", " preds.append(output)\n", "\n", "with open('./data/output.txt', 'w') as f:\n", " for pred in preds:\n", " f.write(str(pred) + '\\n')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }