Kim Lab »  MATLAB Tools
Kim Lab »  MATLAB Tools

Computing Tools

This page contains a collection of in-house programs written in Python 3 (.py) or MATLAB (.m) programming languages. These programs helps us analyze the data we collect. Either Python Interpreter (plus NumPy, SciPyMatplotlib, and/or Biopython libraries) or MATLAB, and sometimes Linux shell, is required to run these programs.

Python (along with Python Interpreter) is a freely available open source programming language available in Linux, Windows, Mac OS, etc. There is also a rich array of 3rd-party numerical and scientific libraries for Python, making it an excellent choice for scientific computing. 

MATLAB has a rich array of toolboxes that simplify many computational tasks. Although I started out writing most analytical tools in MATLAB, I'm gradually in the process of translating all M-codes to Python 3.

For MATLAB and Python scripts, rename .txt into .m and .py, respectively.

If you think the programs may be useful for you, contact Li Tai Fang if you can't figure out how to use them. It may not be obvious without instruction how to use them. 

Microarray Analysis Tools:

File to Download:

Python 3 script to parse Affymetrix expression array data into formats that can easily be imported for data analysis.

Affy_data_parser.py
(last updated 4/2013 )

Python3 script to create correlation network from microarray expressions data for Cytoscape visualization.

Affy_find_correlation_network_overall.py
(last updated 3/2013)

Python 3 script to create correlation network centered around a specific gene. Output .sif, .noa, and .eda for Cytoscape visualization.

Affy_find_correlation_network_for_a_gene.py
(last updated 3/2013)

MATLAB script to generate figures based on Principle Component Analysis (PCA) for microarray expression data. 

pca_normal_vs_tumor.m
(last updated 3/2012)

Python 3 script to compare a gene's expression levels in tumor vs. matched normal tissues in the Affy microarray expression data set.

Affy_1gene_tumor_vs_normal.py
(last updated 8/2012)

Python 3 script to make comparisons between any two genes in the Affy microarray expression data set. 

Affy_2genes_comparative_analysis.py
(last updated 8/2012)

Python 3 script to identify the the genes of the greatest differential expressions (i.e., tumor minus normal expressions)

Affy_top_differential_expressions.py
(last updated 7/2012)

NGS Analysis Tools:

File to Download:

Python 3 script: for SNP calls, extract sequencing data from matched normal sequencing, and attach these information to the tab-separated SNP file generated by LifeScope. This enables numeric filtering to find tumor-specific mutations. 

extract_matched_normal_info_beta.py, for exome or other targeted analysis, where the consensus call file is small enough. (7/2012) 

extract_matched_normal_info_gamma.py, for whole genome analysis, where the consensus call file is too big to fit into memory. (7/2012) 

Slightly modified script from above, where only targeted regions are made into the output file.

extract_matched_normal_info_inTarget.py, if providing a BED file specifying the interested region. (3/2013) 

Python 3 script: after finding small somatic variants by Strelka and annotating the VCF by Annovar, summarize the findings into a list of genes. It will also produce a "refined_excelstyle*.csv" file in each of the folders, putting nonsynonymous mutations in a easily readable format.

summarize_strelka_findings.py, requires Annovar, and prefers Linux operating system. Need to specify the execution directory of Annovar in the script. (5/2013)

Similar to above, but summarizes the VCF output files by Broad Institute's MuTect It will also produce a "refined_excelstyle*.csv" file in each of the folders, putting nonsynonymous mutations in a easily readable format.

summarize_mutect_findings.py, also requires Annovar, which requires the execution directory specified in the script. (5/2013)

... and if you want to compare the somatic calls from GATK MuTect and Illumina's Strelka

tally_MuTect_vs_Strelka.py (4/2013)

... and if you want to dig out the microarray expression data for your NGS mutation calls, and attach it next to those "refined_excelstyle*.csv" files. The results are printed as standard out in the command line. 

attach_Microarray-data_to_NGS-refined-excel.py (5/2013)

Gene parser:

Given a Gene Symbol, this program will give you:
1)  the cDNA Sequence, or
2)  the Peptide Sequence, or
3)  the Exon Sequence for every exon, or

Primer designer uses Primer3 as the core engine, which needs to be installed separately. 

Needed sources: human_hg19.fa and hg19_refGene.txt from UCSC

All Python3 scripts need to be in the same directory, but execute the wrapper_*.py.

gene_readers.py (7/2013, the core module)


wrapper_function.py (6/2013, simple display)
wrapper_fasta_output.py (6/2013)

pcr_overlapping_primers.py (6/2013, Primer3 caller)
wrapper_primer_designer.py (6/2013)

Python 3 script to get partial sequence from Human Reference hg19. I wrote this script when Ensembl website was down and I needed those sequences. 

get_partial_sequence_from_hg19.py
(12/2012)

After using Strelka and the "summarize_strelka_findings.py" script, we often want to validate the mutant calls by Sanger sequencing. This Python 3 script uses Primer3 to automatically design PCR primers for mutation calls collected in the "excelstyle_*.csv" file generated by "summarize_strelka_findings.py. "

find_primers_from_excelstyle.py (12/2012), requires Primer3 locally installed, plus "summarize_strelka_findings.py " and "get_partial_sequence_from_hg19.py" scripts. 

A useful program to analyze the exon and coding region coverage for amplicon design. 

amplicon_analysis.py (1/30/2014), requires the "gene_readers.py" script, as well as resources used by gene_readers.py. 

RNASeq Tools:

 

After using cufflink and cuffcompare, this script tells you where the novel splice sites are when compared to a known splice variant of the same gene.

rnaseq_gtf_reader.py (7/2013, a required module)

splice_site_finder_after_cuffcompare.py (7/2013)

X