{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "\n",
    "0. <a href=\"#sec0\">Dependencies</a>\n",
    "1. <a href=\"#sec1\">Reproducing the Paper</a>\n",
    "2. <a href=\"#sec2\">Training the Model on Custom Datasets</a>\n",
    "3. <a href=\"#sec3\">Inference using Trained Model</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"sec0\"></a>\n",
    "## 0. Dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Python Packages\n",
    "The first step after cloning this repository is download and install the necessary python libraries/packages. Install the required packages by running the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"sec1\"></a>\n",
    "## 1. Reproducing the Paper"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 Download Data\n",
    "The next step is to download the data used to train/evaluate the models. Running the following command will download all 3 datasets, and convert their encodings so that they can be used by PeptideBERT."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python ./data/download_data.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Train-Val-Test Split\n",
    "Now, we want to combine the positive and negative samples (downloaded by the above cell), shuffle them and split them into 3 non-overlapping sets - train, validation, and test.\n",
    "\n",
    "To do so, run the following cell, this will create sub-directories (inside the `data` directory) for each dataset and place the subsets (train, validation, test) inside it.\n",
    "\n",
    "Additionally, if you want to augment any dataset, you can do so by editing `./data/split_augment.py` file. You can call the `augment_data` function from the `main` function with the dataset that you want to augment. For example, if you want to augment the `solubility` dataset, you can add `augment_data('sol')` to the `main` function.\n",
    "\n",
    "Further, to change/experiment with the augmentation techniques applied, you can edit the `augment_data` function. Comment/uncomment the call to any of the augmentation functions (such as `random_replace`, `random_delete`, etc.) as desired, change the factor for augmentation as desired. Do keep in mind that for each augmentation applied, you have to call the `combine` function. For example, if you want to apply the `random_swap` augmentation with a `factor` of 0.2, you can add `new_inputs, new_labels = random_swap(inputs, labels, 0.2)` followed by `inputs, labels = combine(inputs, labels, new_inputs, new_labels)` to merge the augmented dataset into the original dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python ./data/split_augment.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3 Model Config\n",
    "Edit the `config.yaml` file and set the `task` parameter to one of `hemo` (for hemolysis dataset), `sol` (for solubility dataset), or `nf` (for non-fouling dataset) as desired.\n",
    "\n",
    "Additionally, If you want to tweak the model before training, you can do so by editing `./model/network.py` and `config.yaml` files. `./model/network.py` contains the actual architecture of the model as well as the optimizer and scheduler used to train the model. `config.yaml` contains all the hyperparameters used for training, as well as which dataset to train on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4 Training\n",
    "Now we are ready to train our model. Run the following cell to start the training procedure. This will save a checkpoint of the best model (on validation set) inside the `checkpoints` directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python train.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"sec2\"></a>\n",
    "## 2. Training the Model on Custom Datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Follow the cells below to train the model on custom datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Data Preparation\n",
    "\n",
    "create a `csv` file with the following format:\n",
    "```csv\n",
    "sequence,label\n",
    "AAAAAAA,1\n",
    "LLLLLLL,0\n",
    "CCCCCCC,0\n",
    "DDDDDDD,1\n",
    "```\n",
    "where `sequence` is the peptide sequence and `label` is the binary label (0 or 1). Save this file as `custom_data.csv` inside the `data` directory. Now, run the following cell (edit `task_name` as desired) to convert the `csv` file to the format required by PeptideBERT."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "task_name = 'REPLACE_WITH_TASK_NAME'\n",
    "\n",
    "# read data\n",
    "seqs, labels = [], []\n",
    "with open('./data/custom_data.csv', 'r') as f:\n",
    "    for line in f.readlines()[1:]:\n",
    "        seq, label = line.strip().split(',')\n",
    "        seqs.append(seq)\n",
    "        labels.append(int(label))\n",
    "\n",
    "MAX_LEN = max(map(len, seqs))\n",
    "\n",
    "# convert to tokens\n",
    "mapping = dict(zip(\n",
    "    ['[PAD]','[UNK]','[CLS]','[SEP]','[MASK]','L',\n",
    "    'A','G','V','E','S','I','K','R','D','T','P','N',\n",
    "    'Q','F','Y','M','H','C','W'],\n",
    "    range(30)\n",
    "))\n",
    "\n",
    "pos_data, neg_data = [], []\n",
    "for i in range(len(seqs)):\n",
    "    seq = [mapping[c] for c in seqs[i]] \n",
    "    seq.extend([0] * (MAX_LEN - len(seq)))  # padding to max length\n",
    "    if labels[i] == 1:\n",
    "        pos_data.append(seq)\n",
    "    else:\n",
    "        neg_data.append(seq)\n",
    "\n",
    "pos_data = np.array(pos_data)\n",
    "neg_data = np.array(neg_data)\n",
    "\n",
    "np.savez(\n",
    "    f'./data/{task_name}-positive.npz',\n",
    "    arr_0=pos_data\n",
    ")\n",
    "np.savez(\n",
    "    f'./data/{task_name}-negative.npz',\n",
    "    arr_0=neg_data\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Train-Val-Test Split\n",
    "Now, we want to combine the positive and negative samples, shuffle them and split them into 3 non-overlapping sets - train, validation, and test.\n",
    "\n",
    "To do so, edit the `main` function inside `./data/split_augment.py` file (comment existing calls to `split_data` and add the line `split_data('REPLACE_WITH_TASK_NAME')`) and run the following cell, this will create sub-directories (inside the `data` directory) for the custom dataset and place the subsets (train, validation, test) inside it.\n",
    "\n",
    "Additionally, if you want to augment the dataset, you can do so by editing `./data/split_augment.py` file. You can call the `augment_data` function from the `main` function like so: `augment_data('REPLACE_WITH_TASK_NAME')`.\n",
    "\n",
    "Further, to change/experiment with the augmentation techniques applied, you can edit the `augment_data` function. Comment/uncomment the call to any of the augmentation functions (such as `random_replace`, `random_delete`, etc.) as desired, change the factor for augmentation as desired. Do keep in mind that for each augmentation applied, you have to call the `combine` function. For example, if you want to apply the `random_swap` augmentation with a `factor` of 0.2, you can add `new_inputs, new_labels = random_swap(inputs, labels, 0.2)` followed by `inputs, labels = combine(inputs, labels, new_inputs, new_labels)` to merge the augmented dataset into the original dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python ./data/split_augment.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Model Config\n",
    "Edit the `config.yaml` file and set the `task` parameter to `REPLACE_WITH_TASK_NAME`.\n",
    "\n",
    "Additionally, If you want to tweak the model before training, you can do so by editing `./model/network.py` and `config.yaml` files. `./model/network.py` contains the actual architecture of the model as well as the optimizer and scheduler used to train the model. `config.yaml` contains all the hyperparameters used for training, as well as which dataset to train on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Training\n",
    "Now we are ready to train our model. Run the following cell to start the training procedure. This will save a checkpoint of the best model (on validation set) inside the `checkpoints` directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python train.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"sec3\"></a>\n",
    "## 3. Inference using Trained Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Load Trained Model\n",
    "Load the trained model by running the following cell. Edit the `run_name` parameter to the name of the directory containing the trained model (inside the `checkpoints` directory)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import yaml\n",
    "from model.network import create_model\n",
    "\n",
    "run_name = 'REPLACE_WITH_RUN_NAME'\n",
    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
    "\n",
    "\n",
    "config = yaml.load(open('./config.yaml', 'r'), Loader=yaml.FullLoader)\n",
    "config['device'] = device\n",
    "\n",
    "model = create_model(config)\n",
    "model.load_state_dict(torch.load(f'./checkpoints/{run_name}/model.pt')['model_state_dict'], strict=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Input Data\n",
    "Create a text file containing peptide sequences in the following format:\n",
    "```txt\n",
    "AAAAAAA\n",
    "LLLLLLL\n",
    "CCCCCCC\n",
    "DDDDDDD\n",
    "```\n",
    "where each line represents a peptide sequence. Save this file as `input.txt` inside the `data` directory and run the following cell. The corresponding predictions will be saved in `output.txt` file inside the `data` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "seqs = []\n",
    "with open('./data/input.txt', 'r') as f:\n",
    "    for line in f.readlines():\n",
    "        seq = line.strip()\n",
    "        seqs.append(seq)\n",
    "\n",
    "MAX_LEN = max(map(len, seqs))\n",
    "\n",
    "# convert to tokens\n",
    "mapping = dict(zip(\n",
    "    ['[PAD]','[UNK]','[CLS]','[SEP]','[MASK]','L',\n",
    "    'A','G','V','E','S','I','K','R','D','T','P','N',\n",
    "    'Q','F','Y','M','H','C','W'],\n",
    "    range(30)\n",
    "))\n",
    "\n",
    "for i in range(len(seqs)):\n",
    "    seqs[i] = [mapping[c] for c in seqs[i]] \n",
    "    seqs[i].extend([0] * (MAX_LEN - len(seqs[i])))  # padding to max length\n",
    "\n",
    "preds = []\n",
    "with torch.inference_mode():\n",
    "    for i in range(len(seqs)):\n",
    "        input_ids = torch.tensor([seqs[i]]).to(device)\n",
    "        attention_mask = (input_ids != 0).float()\n",
    "        output = int(model(input_ids, attention_mask)[0] > 0.5)\n",
    "        print(output)\n",
    "        preds.append(output)\n",
    "\n",
    "with open('./data/output.txt', 'w') as f:\n",
    "    for pred in preds:\n",
    "        f.write(str(pred) + '\\n')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}