KegArray ReadMe file v1.2.4 Copyright (c) 2004 - 2009, 2015 Kanehisa Laboratories, All rights reserved. KegArray is a Java application that provides an environment for analyzing both transcriptome data (gene expression profiles) and metabolome data (compound profiles). Tightly integrated with the KEGG database, KegArray enables you to easily map those data to KEGG resources including PATHWAY, BRITE and genome maps. ---------------------------------------------------------------------------- Table of contents 1. Requirements 2. Instructions 2-1. Installation and starting KegArray 2-2. Data format of input files 2-2-1. Format for transcriptome data 2-2-2. Format for metabolome data 2-2-3. Entry IDs from databases other than KEGG 2-2-4. Using data prepared by Excel 2-3. KegArray control panel 2-4. KegArray tool-bar menu 3. Limitations 3-1. Out of memory error 4. About the License 5. Acknowledgement 6. Feedback ---------------------------------------------------------------------------- 1. Requirements - Java 1.6 + 2. Instructions 2-1. Installation and starting KegArray a. Simply expand the KegArray archive file. b1. Start KegArray by double clicking the KegArray bat / sh file (KegArray.bat / KegArray.sh). or b2. Start KegArray using the following command. > cd [Installation Directory] > ./KegArray.sh - You need to set the browser program path first time. (Open the preferences panel (Menu bar > Help > Preferences / Network tab).) 2-2. Data format of input files 2-2-1. Format for transcriptome data KegArray can read data format of the EXPRESSION database (http://www.genome.jp/kegg/expression/) or tab-deliminated text similar to the EXPRESSION format. Each entry of EXPRESSION consists of brief description about the experiment, reference information, and a set of intensity values or ratios of two-channels derived from a DNA microarray. An example for intensity values is given below. ------------------------------------------------------------------------- #organism: syn #ORF x y Control-sig Control-bkg Target-sig Target-bkg slr1485 1 2 1037.13 502.62 1593.30 695.25 slr1119 1 3 1261.63 494.72 2685.37 742.87 sll0708 1 4 922.97 561.38 1598.37 727.28 sll1120 1 6 2152.80 560.96 2591.23 771.07 sll1734 1 7 1918.47 574.57 5968.97 823.66 : : ------------------------------------------------------------------------- An example for expression ratios between two-channels is given below. ------------------------------------------------------------------------- #organism: syn #ORF x y ratio slr1485 1 2 0.610282 slr1119 1 3 2.360655 sll0708 1 4 0.842321 sll1120 1 5 0.769038 : : ------------------------------------------------------------------------- All lines beginning with the "#" character (other than the '#organism:' or '#source:' line) are regarded as comments and skipped by KegArray. The organism information is necessary to identify the ORFs, because the entry identifier of KEGG GENES database is a combination of the organism code and the entry identifier joined by a colon (:), as in 'org:entry_id'. The organism should be provided by the three-letter (or four-letter) organism code used in KEGG. (e.g. 'hsa' for human and 'mmu' for mouse. Full list is available from http://www.genome.jp/kegg/catalog/org_list.html). The lines in tab-delimited format below the #ORF section contain gene expression profile data. The definition of each column is as follows. - Columns for intensity values First column KEGG GENES ID which is the unique identifier of the ORF in the organism. Second column X-axis coordinate information of the ORF on the microarray. Third column Y-axis coordinate information of the ORF on the microarray. The second and third columns are used for specifying the location of the ORF in the schematic of the DNA microarray (ArrayImage in KegArray analysis view). Fourth column Signal intensity of the control channel. Fifth column Background intensity of the control channel. Sixth column Signal intensity of the target channel. Seventh column Background intensity of the target channel. When the subtraction of the background from the signal intensity (4th column - 5th column and 6th column - 7th column) is negative, KegArray treats it as 1. - Columns for ratio values First column KEGG GENES ID which is the unique identifier of the ORF in the organism. Second column X-axis coordinate information of the ORF on the microarray. Third column Y-axis coordinate information of the ORF on the microarray. The second and third columns are used for specifying the location of the ORF in the schematic of the DNA microarray (ArrayImage in KegArray analysis view). Fourth column Ratio value between control channel and target channel. When there are no coordinate information (i.e. the second and third columns are blank), KegArray assigns their coordinates information properly. 2-2-2. Data format for metabolome data Only ratio values can be used for metabolome data like the following example. ------------------------------------------------------------------------- #COMPOUND ratio C00668 1.2 C00221 0.5 C01172 2.2 C00118 1.0 : : ------------------------------------------------------------------------- First column KEGG COMPOUND ID (e.g. C00668 for alpha-D-Glucose 6-phosphate) Second column Relative amount of the target compound compared with the control. 2-2-3. Entry IDs from databases other than KEGG KegArray can convert the external database IDs to the KEGG GENES IDs, which are necessary for mapping the array data to the KEGG resources such as pathway maps. The following example shows the case where NCBI-GIs are used in the first column. ------------------------------------------------------------------------- #organism: bsu 2633829 1 1 938 189 725 249 2633830 2 1 2692 189 2253 249 2633899 3 1 958 189 444 249 2636068 4 1 6703 189 2533 249 : : ------------------------------------------------------------------------- Using the ID converter (see the next subsection), you can convert the NCBI-GIs to KEGG GENES IDs. Currently, the following external databases are supported: External database Database prefix ----------------- --------------- NCBI GI ncbi-gi NCBI Entrez Gene ncbi-gene GenBank gb UniGene unigene UniProt up IPI ipi 2-2-4. Using data prepared by Microsoft Excel To convert data in Microsoft Excel format for KegArray, you need to order the columns as in the KegArray format in advance and save them as "tab-deliminated text" using the "File->Save as" menu in Excel. 2-3. KegArray control panel Once you launch KegArray, you will see the KegArray control panel. Each function in the panel is described below. *Data There are two tabs to select the Gene/Compound or Clustering pane at the top of the KegArray control panel. In the Gene/Compound pane, you can load a data file of transcriptome and/or metabolome experiments and set parameters. In the Clustering pane, you can load several data files of transcriptome experiments and set an intensity threshold. *Gene/Compound pane *File: There are five buttons and a checkbox to display the pop-up window for specifying the input data file. [Local] button Open a pop-up window to select a data file on your local disk. The data file should comply with the format described in the section 2-2-1. [GenomeNet] button Open a pop-up window to retrieve the data stored in the GenomeNet EXPRESSION database. Availalbe entry IDs are listed in the window, and once you select one, its description will be displayed. [Compound data] This box should be checked (default) for loading metabolome data. You can ignore the metabolome data after loading by unchecking it. [Local] button Open a pop-up window to select a compound data file on your local disk. The data file should comply with the format described in the section 2-2-2. *Threshold and normalization *Linear pane There are three input boxes to specify the parameters (Ratio threshold for transcriptome data and metabolome data and Intensity threshold for transcriptome data) for the confidence lines discriminating the regulated genes/compounds from unregulated ones. *CC pane There are three input boxes to specify the parameters (Window length, Window size and Significance level) for the confidence curves discriminating the regulated genes from unregulated ones. *Cancel/Apply buttons Once the values in the threshold and normalization input boxes are modified, "Linear" and "CC" in the pane names are marked with '*', which means the modified values have not been applied yet. You have to click "Apply" button to use the new values for the following analyses. *KegArray analysis view After loading a transcriptome data file, a window is automatically launched to display information on the data in four panes. *Statistics In the Statistics pane, two distributions of gene expression intensities and ratios are shown, which can be used for specifying the threshold. *ArrayImage In the ArrayImage pane, a schematic view of DNA microarray is shown. The colors of spots represent levels of increase or decrease of the target gene expressions against the control. The coloring scheme can be changed in the preference menu (KegArray > Preferences > Color). Each spot is clickable and linked to the corresponding KEGG GENES database entry. *Scatter plot (for Linear pane) The scatter plot of the data is shown in this pane. The colors of spots represent levels of increase or decrease of the target gene expressions against the control. The coloring scheme can be changed in the preference menu (KegArray > Preferences > Color). A zoom up view is launched by dragging an area of interest. In this view, each spot is clickable and linked to the corresponding KEGG GENES database entry. *MA plot (for CC pane) The MA plot of the data is shown in this pane. The colors of spots represent levels of increase or decrease of the target gene expressions against the control. The coloring scheme can be changed in the preference menu (KegArray > Preferences > Color). A zoom up view is launched by dragging an area of interest. In this view, each spot is clickable and linked to the corresponding KEGG GENES database entry. *Clustering tab *Files: You can load the data files from your local disk or the EXPRESSION database. [Local] button Open a pop-up window to select data files on your local disk. Each data file should comply with the format described in the section 2-2-1. [GenomeNet] button Open a pop-up window to retrieve the data stored in the GenomeNet EXPRESSION database. Availalbe entry IDs are listed with the checkbox in the window. The description of each entry will be displayed if you select one. [Clustering] button Once you select more than one data files, this button becomes active. Hierarchical clustering of the gene expression profiles constructed from the files listed will be performed by clicking this button. A tree view window is shown when the clustering is completed. You can change the number of clusters (1 - 6) by specifying the number in the input box at the top of the tree view window. Different clusters are shown in different colors. Clicking [Set results] button saves the color coding for further analysis using the Tools section. *Organism The organism name of the input data (specified by "#organism:" header). *Number of files The number of data files used for clustering. *Intensity threshold This is used to specify the threshold for the confidence lines discriminating the regulated genes from the unregulated ones. Only the genes with the intensity value above this threshold will be used for clustering. *Clustering algorithm Selection for a clustering algorithm from the pull-dowm menu (Currently only complete linkage is available). *Tools *Mapping to Pathway Map the gene expression and/or compound profiles onto the KEGG PATHWAY database. By clicking [Go] button, a list of pathways with the genes and/or compounds specified in the PathwayMap window is shown. *Mapping to BRITE Map the gene expression and/or compound profiles onto the KEGG BRITE database that is a collection of hierarchical classifications representing our knowledge on various aspects of biological systems. By clicking [Go] button, a list of BRITE hierarchy data with the genes specified in the BRITE window is shown. *Mapping to Genome map Map the gene expression profiles onto the genome map provided by GenomeNet. The coloring scheme representing gene expression profiles can be changed in the preference menu (KegArray > Preferences > Color). *Mapping to KEGG DAS Map the gene expression profiles onto the genome map provided by KEGG DAS. The coloring scheme representing gene expression profiles can be changed in the preference menu (KegArray > Preferences > Color). Please note that only KEGG IDs are available for mapping genes/compounds to the genome map, pathway map and BRITE data. If you use IDs other than KEGG IDs, ID conversion is necessary before using these tools (see below). *ID conversion The ORF IDs from the databases other than KEGG can be converted to the KEGG GENES IDs by using the ID converter provided by GenomeNet (the Internet access is necessary). NCBI-GI, GenBank, IPI, NCBI-Gene, UniProt, UniGene are available for the target databases. The list of the conversion can be seen from the menu Tools > Conversion table. 2-4. KegArray tool-bar menu *KegArray [About KegArray] Show the version and copyright of KegArray *Preferences [Color] In the Color pane, you can specify the coloring scheme for the gene expression and comound profiles and the number of color gradient. [Network] In the Network pane, you can specify the server URL of the link action target and the proxy server URL. [Conversion] In the Conversion pane, you can specify the server URL providing the database ID conversion tool (only GenomeNet is available now). The list of the databases can be edited if you know the update of the database list of GenomeNet server. *File *Gene [Load data from EXPRESSION] Specify the entry id of the GenomeNet EXPRESSION database. [Load data from local file] Specify the file name of the gene expression data on your local disk. *Compound [Load data from local file] Specify the file name of the compound and ratio data on your local disk. *Edit [Find] Search for a gene by KEGG GENES ID. The gene will be marked on the ArrayImage, Scatter plot and MA plot viewers in the KegArray analysis view window. *View [Statisticts] Show Statisticts viewer. [ArrayImage] Show ArrayImage viewer. [Scatter Plot (linear)] Show Scatter plot viewer. [MA Plot (CC)] Show MA plot viewer. *Tools [PathwayMap] Map the gene expression and/or compound profiles onto the KEGG PATHWAY database. By clicking [Go] button, a list of pathways with the genes and/or compounds specified in the PathwayMap window is shown. [BRITE] Map the gene expression and/or compound profiles onto the KEGG BRITE database that is a collection of hierarchical classifications representing our knowledge on various aspects of biological systems. By clicking [Go] button, a list of BRITE hierarchy data with the genes specified in the BRITE window is shown. [GenomeMap] Map the gene expression profiles onto the genome map provided by GenomeNet. The coloring scheme representing gene expression profiles can be changed in the preference menu (KegArray > Preferences > Color). [GenomeMap (KEGG DAS)] Map the gene expression profiles onto the genome map provided by KEGG DAS. The coloring scheme representing gene expression profiles can be changed in the preference menu (KegArray > Preferences > Color). [ID conversion] The ORF IDs from the databases other than KEGG can be converted to the KEGG GENES IDs by using the ID converter provided by GenomeNet (the Internet access is necessary). NCBI-GI, GenBank, IPI, NCBI-Gene, UniProt, UniGene are available for the target databases. The list of the conversion can be seen from the following menu Conversion table. [Conversion Table] Show conversion result table. Please note that only KEGG IDs are available for mapping genes/compounds to the genome map, pathway map and BRITE data. If you use IDs other than KEGG IDs, ID conversion is necessary before using these tools. *List *Gene The following menu items will display a pop-up table listing the regulated genes. The number of listed genes can be modified by specifying a value in the box at upper-right of the pop-up table. [Up-regulated (Linear)] List the up-regulated genes whose intensity ratios are greater than the upper linear confidence line on the scatter plot. [Down-regulated (Linear)] List the down-regulated genes whose intensity ratios are less than the lower linear confidence line on the scatter plot. [Up-regulated (CC)] List the up-regulated genes whose intensity ratios are greater than the upper confidence curve on the MA plot. [Down-regulated (CC)] List the down-regulated genes whose intensity ratios are less than the lower confidence curve on the MA plot. *Compound The following menu items will display a pop-up table listing the regulated compounds. The number of listed compounds can be modified by specifying a value in the box at upper-right of the pop-up table. [Up-regulated] List the up-regulated compounds whose ratios are greater than the threshold specified in the control panel. [Down-regulated] List the down-regulated compounds whose ratios are less than the threshold specified in the control panel. 3. Limitations 3-1. Out of memory error Clustering a large number of expression profiles will require a large heap memory size and KegArray will terminate the process with the "out of memory error" message. To avoid this, please set the intensity threshold higher (to decrease the number of genes) or run KegArray from a terminal by a command line as the following > java -jar -XmxM KegArray.jar You have to set larger size of memory than the default (usually 64M). (e.g. >java -jar -Xmx256M KegArray.jar) 4. About the License The KegArray license corresponds to the license of KEGG. Refer to the page below. http://www.genome.jp/kegg/legal.html Title and intellectual property rights in and to any content displayed by or accessed through this software belongs to the respective content owner. Such content may be protected by copyright or other intellectual property laws and treaties, and may be subject to terms of use of the third party providing such content. 5. Acknowledgment Some charts in this product have been developed by using the JFreeChart libraries (http://www.jfree.org/jfreechart/). 6. Feedback We appreciate any suggestions and comments. Please use the GenomeNet feedback form at the following URL to send your comments. http://www.genome.jp/feedback/?category=kegtools ==Escape Clause THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL KANEHISA LABORATORIES OR ITS STAFF BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION). HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.