*** Quick start

There is a number of options and parameters available, but we suggest to first try to execute the following steps to verify that all the components are working properly. Here, we will create a structure for the chromosome 14 using our basic pipeline (MMC) and open a web-based visualizer to study the results. The commands can be simply copy-pasted into the terminal.

0. Download the original ChIA-PET data used in the study (http://nucleus3d.cent.uw.edu.pl/3dgnome/data.tar.gz) and save it to the data/ directory.

1. Use make to compile the MMC component:
	$ make

2. Start the simulation:
	$ mkdir chr14
	$ ./3dnome -s ./stg.ini -c chr14 -o ./chr14/

3. Convert the resulting file to HDF format (you need to have Java 8 installed) and copy it to the visualizer data directory:
	$ java -jar HcmToHdfConverter/HcmToHdfConverter.jar chr14 ./chr14/loops_chr14.hcm
	$ cp chr14/loops_chr14.h5 viewer/public_html/data/

4. Start the visualization server (you need the Flask python framework installed):
   $ pip install flask [run only if not already installed on your system]
   $ python viewer/public_html/src/viewapp.py

5. Use your browser to navigate to localhost:5000/index.html. 

Congratulations! You can now view the structure you have just generated by selecting it in the visualizer using the "Select model" option. 

The segment level structure is displayed by default, to see the subanchor level structure select "2" from the "level" option. You may want to play with different options. To change the line width expand the "Chromatin Options" submenu and move the "Line Thickness" slider with your mouse or enter the value manually in the field on the right. To display to the legend for the genomic coordinates and colors use "Display Options" -> "Overlay display" -> "Coord. Color" checkbox.

*** Ensembles

An ensemble of structures can be generated by simply providing a size of ensemble using the [m] option:

   $ mkdir TAD_ensemble
   $ ./3dnome -s ./stg.ini -c chr14:54832979-57034235 -o ./TAD_ensemble/ -m 10

This will generate 10 files representing structures of a TAD corresponding to the specified region. To keep files in order the serial numbers are added as a suffix.

We can force the simulation algorithm to stop at the specified level, which is useful for example when we want to study a large number of structures on a segment level (and we are not interesed in the subanchor level at all). This can be done by supplying the [v] option, with the parameter being the maximal depth of the reconstruction (0 corresponds to chromosome level, 1 - segment level, 2 - anchor level, 3 - subanchor level). For example, to create 50 structures of chr14 at the segment level we can run the following commands:

   $ mkdir chr14_ensemble
   $ ./3dnome -s ./stg.ini -c chr14 -o ./chr14_ensemble/ -m 50 -v 1

To facilitate the analysis of the ensemble structures we provided a simple tool to calculate the matrix of their pairwise similarity. This can be done with the following command:

   $ ./3dnome -a ensemble -i ./chr14_ensemble/ -p loops_chr14_{N}.hcm

Here, with "-a ensemble" option we tell the software that we want to perform the ensemble analysis, option -i is to specify the directory where the structures are, and with -p we provide the ensemble filenames pattern. "{N}" represents the serial number that was automatically added in the previous step. The pattern will be matched against the files in the specified directory and all the matching files will be used in the analysis. This analysis is performed on all 3 main levels, and the results - in a form of similarity matrices - are saved in the same directory with filenames structural_distances_lvlM.heat, where M denotes the level (0 - chromosome, 1 - segment, 2 - subanchor). These matrices can be directly used for the hierarchical clustering or any other analysis method.


*** Extended pipelines
Start with using GD and MDS methods to generate graph distance matrix and rough 3D structure on the segment level.

   $ python MDS/GDcast.py chr14/singletons_chr14_segment.heat 0 0 0 luk 0 kruskal

Several new files were generated, the most interesting for us are chr14/singletons_chr14_segment.heat__0.0.pdb and singletons_chr14_segment.heat__0.0.mds.protmap.txt files, which contain the rough 3D structure (in PDB format) and the graph distances matrix, respectively.

As we noted in the paper there are two main extensions to the basic pipeline presented above.

a) Using a distance heatmap for the segment level. Normally, the distance heatmap is generated based on the singleton interactions heatmap, but an arbitrary matrix can be used for this purpose. In our GD pipeline we use the graph distances matrix (*.mds.protmap.txt) obtained from the GDcast script by supplying the 3dnome program with a -d option:

   $ ./3dnome -s ./stg.ini -c chr14 -n chr14_gd -o ./chr14/ -d ./chr14/singletons_chr14_segment.heat__0.0.mds.protmap.txt

Note that we use -n option to set a label that will be used in the names of the output files. If we wouldn't provide this option a default label (which is the chromosome id) would be used and the structure obtained previously (during the Quick Start) would be overwritten.

The heatmap can be also provided via the settings file using the 'dist_heatmap' option in the [template] node. As heatmaps may come from different sources the magnitude of distances may need to be adjusted to match the size of domains obtained on the subanchor level. This can be done by using the 'dist_heatmap_scale' option.

b) Using a structural template on the segment level. The template is simply a text file with a list of 3D coordinates for the segment beads. An example of a template file can be found at sample_data/chr14_template.txt. Alternatively, a template can be extracted from the hcm file. This way a number of structures can be generated with different parameters, but using the same segment-level backbone, which may facilitate the comparisons.

To use a template simply provide a path to the template as an additional argument in step 2 (-t). For example:

   $ cat chr14/singletons_chr14_segment.heat__0.0.pdb | awk '{print $7" "$8" "$9 }' > chr14/chr14_template.txt
   $ ./3dnome -s ./stg.ini -c chr14 -n chr14_mds -o ./chr14/ -t ./chr14/chr14_template.txt

The second command is used to extract the 3D coordinates from the PDB file.

Alternatively, an hcm or txt file can be used as a template. The sample_data/ directory contains two sample files that can be used: 

   $ ./3dnome -s ./stg.ini -c chr14 -n chr14_hcm -o ./chr14/ -t ./sample_data/loops_chr14.hcm -p 1
   $ ./3dnome -s ./stg.ini -c chr14 -n chr14_txt -o ./chr14/ -t ./sample_data/chr14_template.txt -p 2

Similarly as for the distance matrices the template and its scaling factor can be set via the settings file ('template_segment' and 'template_scale' options).



*** Nucleome modeling
In the examples above we worked with a single chromosome. One can run the simulation for the whole nucleome or for a selected set of chromosomes in a similar fashion by providing multiple chromosomes in the -c option.
   $ mkdir nucleus
   $ ./3dnome -s ./stg.ini -c genome -n all -o ./nucleus/
   $ mkdir nucleus_subset
   $ ./3dnome -s ./stg.ini -c chr4-chr12 -n subset -o ./nucleus_subset/

Multiple chromosomes can be provided using a comma ('chr1,chr2,chr3') or using a range notation ('chr1-chr22', 'chr10-chr16'). A keyword 'genome' is a shorthand for 'chr1-chr22,chrX'. As for now, the web-based visuzalization tool doesn't allow to load multiple chromosomes at once. Such a functionality will be added shortly.



*** Data files details
As singleton files tend to be large it is possible to provide inter- and intrachromosomal singletons files separately (using 'singletons_inter' and 'singletons' options, respectively). This allows the program to skip the interchromosomal files reading if they are not needed (e.g. when a single chromosome is reconstructed). 

Generation of the (sub)anchor heatmaps (consult 'use_anchor_heatmap' and 'use_subanchor_heatmap' settings) is hard disk extensive, and it is beneficial to create the intrachromosomal singletons files for every chromosome separately, and to use these per-chromosome files. To use this option one need to set the 'split_singleton_files_by_chr' flag on and to generate the per-chromosome files. They can be easily generated using a following commands:

   $ cd data
   $ mkdir chr
   $ for i in `seq 1 22` X; do ( cat clusters.txt | awk '{if ($1 == "chr'$i'") print $0}' > chr/clusters.txt.chr$i ) done

, where clusters.txt is the original clusters file (this should be done for all the files from the 'clusters' setting). The resulting files should be created in the chr subdirectory of the data directory, and they should have the same name as the original file but with a chromosome id as a suffix. If the 'split_singleton_files_by_chr' flag is set the program will automatically look for the per-chromosome files.



*** MMC Settings
There is a number of settings available for the MC simulation. We will now shortly describe the most important ones with an intuitive meaning, the full list of options will follow.

- freq_dist_power - this exponent describes the relation between singletons interaction frequency on a segment level and the physical distances between beads, significantly different values will yield different shapes. Different values were used in the literature. A value of -1.0 correspond to a simple inverse relation.
- freq_dist_scale - scaling of the segment level structure 
- genomic_dist_scale - responsible for the size of the chromatin loops
- use_motif_orientation - whether or not to use the CTCF motif orientation
- use_subanchor_heatmap - whether or not to use the subanchor heatmap to refine chromatin loops modeling

Below a description of all the settings available is provided.

* [main]
output_level - set the level of output messages (range=0..10)
random_walk - if true then create a random walk structure on the segment level
loop_density - number of subanchor beads that will be placed between the consecutive anchor beads
use_2D - if true then the simulation is restricted to 2 dimensions
max_pet_length - maximal length of the PET clusters used on the subanchor level (in bp)
long_pet_power, long_pet_scale - describe how long PET cluster impact the segment heatmaps (scale*C^power, where C is PET count)

steps_lvl1 (lvl2, arcs, smooth) - number of simulations on the corresponding levels  (lvl1 - chromosome level, lvl2 - segments, arcs - anchor, smooth - subanchor)
noise_lvl1 (lvl2, arcs, smooth) - amount of noise used to create the initial structures on the corresponding levels

[data]
data_dir - path of the directory with data files
anchors - name of the anchor file
clusters - names of the cluster files (comma separated)
factors - names of the factors used in
singletons - names of files with intrachromosomal singletons
split_singleton_files_by_chr - flag denoting whether the files in 'singletons' were splitted by chromosome
singletons_inter - names of the files with interchromosomal singletons
segment_split - path to a BED file with the segment split info
centromeres - path to a BED file with the centromere locations

[distance]
genomic_dist_power, genomic_dist_scale, genomic_dist_base - describe relationship between genomic distance and the physical distance between subanchor beads (3D dist = base+scale*d^power, where d is genomic distance in kb)

freq_dist_scale, freq_dist_power - describe relationship between interaction frequency and physical distance, used to generate segment level expected distances matrix (3D dist = scale*F^power, where F is interaction frequency)

freq_dist_scale_inter, freq_dist_power_inter - the same as freq_dist_scale, but used for the chromosome level. Allows to use different relation for segment and chromosome level.

count_dist_a, count_dist_scale, count_dist_shift, count_dist_base_level - describe relationship between PET count and the physical distance between subanchor beads (3D dist = base+scale/e^[a*(shift+C)], where C is PET count)

[template]
template_segment - path to the file the structural template file (if any)
template_scale - scale for the structural template
dist_heatmap - path to the file the structural template file (if any)
dist_heatmap_scale - scale for the structural template

[motif_orientation]
use_motif_orientation - whether to consider CTCF morif orientation during the simulation or not
weight - the weight assigned to the motif orientation energy term

[anchor_heatmap]
use_anchor_heatmap - whether or not to construct the anchor heatmap to refine the anchor beads placement
heatmap_influence - the influence of the pairwise anchor distances matrix

[subanchor_heatmap]
use_subanchor_heatmap - whether or not to construct the subanchor heatmap to refine the subanchor beads placement (i.e. chromatin loops shape and relative positions)
estimate_distances_steps - number of structures created to obtain the expected distance matrix
estimate_distances_replicates - number of simulation steps for every structure 
heatmap_influence - the influence of the pairwise anchor distances matrix
heatmap_dist_weight - the weight assigned to the expected distance matrix energy term

[heatmaps]
inter_scaling - scaling factor applied to the interchromosomal contacts (segment level). This can be used  
distance_heatmap_stretching - used to calculate the cap value for large 3D distances (cap = average * stretching) 

[springs]
stretch_constant_arcs - weight assigned to the flexibility energy term on the anchor level, when the distance is higher than expected
squeeze_constant_arcs - as above, but for distances smaller than expected

stretch_constant - as stretch_constant_arcs, but on the subanchor level
squeeze_constant - as squeeze_constant_arcs, but on the subanchor level
angular_constant - weight assigned to the bending energy term (subanchor level)


[simulation_heatmap]
max_temp_heatmap - initial temperature for the simulated annealing
delta_temp_heatmap - temperature reduction between iterations
jump_temp_scale_heatmap, jump_temp_coef_heatmap - parameters to scale the probability of accepting move with higher energy
stop_condition_improvement_threshold_heatmap - improvement ratio that is required for the algorithm to stop (must be higher)
stop_condition_successes_threshold_heatmap - if the number of accepted moves during a milestone is higher than this value than the algortihm continues
stop_condition_steps_heatmap - number of steps for each milestone

[simulation_arcs]
[simulation_arcs_smooth]
the same as for [simulation_heatmap]