# ------------------------------------------------------------------ #
# THE AUTOMATED CLUSTERING PROGRAM README                            #
# K.K.DNAFORM                                                        #
# ------------------------------------------------------------------ #

1. INTRODUCTION
	The Cap Analysis of Gene Expression (CAGE), which is based on 
sequencing of 5' ends of CAP-selected full-length cDNAs, allows
high-throughput gene expression analysis and shows both the 
location and strength of transcription start sites (TSSs).
	Our program performs the clustering and quality control 
suitable for CAGE sequence tags. This program uses the modified 
Paraclu, the segment-scoring scheme developed by Frith et al. (2008)
for parametric clustering. To measure reproducibility between
replicates fo each peak, the irreproducible discovery rate (IDR)
(Li et al., 2010) was adopted.

2. REQUIREMENTS
	This program needs the following softwares installed on your 
computer.
	- R (http://cran.r-project.org/)
	- SAMtools (http://samtools.sourceforge.net/) 
	- BEDtools (http://code.google.com/p/bedtools/)

3. INSTALLATION
	1) Download the idr_paraclu_pipeline.tar.bz2 file.
	2) Decompress the file.
		$ tar xvjf idr_paraclu_pipeline.tar.bz2
	3) Run install.sh
		$ cd idr_paraclu_pipeline
		$ ./install.sh
	4) Input directory path where you want to install this program
	   during the implementation.

4. USAGE
	automatedClustering inputdir outputdir tpm idr outputdir2 project
		inputdir	directory path containing input files. At least 2
					BAM files are needed. This program performs the 
					clustering the all combinations of replicates.
		outputdir	directory path to output the result files. This 
					program creates 2 BED files; top peaks file and 
					bottom peaks file. This program extracts clusters 
					at the top of each hierarchical cluster block and 
					these clusters are called "top peaks". Likewise, 
					this program also extracts clusters at the bottom 
					of each cluster collection and they are "bottom 
					peaks".
		tpm			TPM used for a threshold when clustering.
					We recommend to use 0.1, however, use another 
					values if you don't obtain an appropriate result.
		idr			IDR used for a threshold when discarding 
					irreproducible clusters. We recommend to use 0.1.
		outputdir2	directory path to output files for scatter plot.
					You can create the scatter plot of hierarchical 
					stabilities between replicate 1 and 2 by the 
					following commands.
					$ R
					> source( "plotStability.R" )
					> PLOTSTAB( "cluster_[YOUR PROJECT]_stabilityplot.txt",
					  "stability.png" )
					> q()
		project		Your project name. You can use a name you like.

5. OUTPUT
	This program mainly creates 2 BED files for the top and bottom peaks.
	Their format is as follows:
		Column1: The name of chromosome
		Column2: The starting position of the cluster
		Column3: The ending position of the cluster
		Column4: Hierarchical stability of the cluster
		Column5: IDR
		Column6: Strand

6. REFERENCE
	Frith MC, Valen E, Krogh A, Hayashizaki Y, Carninci P, Sandelin A. 
	2008. A code for transcription initiation in mammalian genomes.
	Genome Res. 18:1-12.

	Li Q, Brown JB, Huang H, Bickel PJ. 2011. Measuring reproducibility 
	of high-throughput experiments. Annals of Applied Statistics.
	5(3):1752-1779.

