# Generating Scores using CircRNA Random Forest Classifier

We provide a Python script `random_forest_train_test.py` which is designed to test a Random Forest classifier on circular RNA (circRNA) datasets based on a pre-trained model (trained on human brain sample).

## Requirements

This script requires Python 3 and the following Python libraries:

- NumPy
- pandas
- scikit-learn
- joblib

### Installation with pip

You can install the required packages using `pip` if you are using a standard Python environment:

```bash
pip install numpy pandas scikit-learn joblib
```

### Installation with conda

If you are using a conda environment (recommended for managing dependencies), you can install the required packages using `conda`:

```bash
conda install numpy pandas scikit-learn joblib
```

### Testing

To test the pre-trained model, use the `test` command with the following syntax:

```bash
python3 random_forest_train_test.py test -m brain.pre_trained.joblib -i feature_file -o output_folder/prefix
```

* Replace `feature_file` with the feature file generated by TERRACE using the `-fe` command. See the `Scoring` section in the original README of TERRACE for more details.
* Replace output_folder/prefix with the path to your desired output folder and the prefix for the output files.

This command will:

1. Load the specified pre-trained model.
2. Test the model using the provided testing data files.
3. Output the prediction probabilities (scores) to files following the pattern output_folder/prefix.feature_file.prob.csv, with circRNA_id followed by the corresponding probabilities (scores) in the range 0 to 1.

# Integrating the Scores

Use the provided script `integrate.py` to embedd the scores from the Random Forest model `prefix.feature_file.prob.csv` into the `output.gtf` file generated by TERRACE. The usage is:

```
python3 integrate.py <output.gtf> <prefix.feature_file.prob.csv> <output-with-score.gtf>
```

`output.gtf` is the original GTF file produced by TERRACE with abundance values in the `score` field.

`prefix.feature_file.prob.csv` is a list of circRNA_id and score tuples generated from the Random Forest testing commands.

`output-with-score.gtf` is a modified GTF file with the same list of circRNAs as in `output.gtf` but the `score` field changed to represent Random Forest scores instead of abundance.

# Generating Precise CircRNAs

Use the provided script `precise.py` that takes a threshold value (0-1) and the `output-with-score.gtf` as input and generates a file `precise.gtf` that contains a list of circRNAs above the given threshold. The usage is:

```
python3 precise.py <output-with-score.gtf> <precise.gtf> <threshold>
```

`output-with-score.gtf` is the GTF file with Random Forest scores integrated. 

`precise.gtf` is the output GTF file containing a precise list of circRNAs with scores above the given threshold.

`threshold` a float value from 0-1. CircRNAs with scores below this value will be discarded.

# Example

The example directory contains the TERRACE output file `example-output.gtf` and feature file `feature_file` generated following the commands in the `Scoring` section of the original README of TERRACE. 

Commands to generate scores using the pre-trained model, integrate the scores, and generate precise.gtf from the `example` directory are as follows.

```
cd ./example
```
Move to the example directory.

```
python3 ../random_forest_train_test.py test -m ../brain.pre_trained.joblib -i feature_file -o ./example
```

An intermediate file with probabilities/scores `example.feature_file.prob.csv` will be created.

```
python3 ../integrate.py example-output.gtf example.feature_file.prob.csv example-output-with-score.gtf
```

A GTF file with the integrated scores `example-output-with-score.gtf` will be created.

```
python3 ../precise.py example-output-with-score.gtf precise.gtf 0.8
```

A precise GTF file `precise.gtf` containing circRNAs with score above 0.8 will be created.  

