About Human Gene and Protein Database (HGPD)

The entire human genome sequence has been determined by international project teams (1). In the post-genomic research era, one of the most essential subjects involves functional and structural analysis of gene products (proteins). To obtain full-length cDNA clones in hand is one of the key issues in such studies of functional genomics. Projects such as the Japanese FLJ project supported by NEDO (New Energy and Industrial Technology Development Organization) (2-4), the Kazusa long cDNA project supported by Chiba prefecture (a local government) (5, 6), the US Mammalian Gene Collection (MGC) program (7), German (8), Chinese (9) and other cDNA projects, have been implemented to isolate as many full-length cDNAs and as in a high quality as possible (For a review, see reference 10). To build the infrastructure to allow systematic and comprehensive expression of human proteins, not only the availability of full-length cDNA clones, but also a versatile system for making use of these clones is vital. The Gateway cloning system (Invitrogen, CA, USA) is based on such versatile expression vectors (11). We have therefore adopted this system and constructed human Gateway entry clones from full-length cDNAs (12). For conversion to Gateway entry clones, we first determined an open reading frame (ORF) region in each cDNA meeting the criteria (13,14). Those ORF regions were PCR-amplified utilizing selected resource cDNAs as templates. All the details of the construction and utilization of entry clones will be published elsewhere (12). Amino acid and nucleotide sequences of an ORF for each cDNA and sequence differences of Gateway entry clones from source cDNAs are presented in the "GW: Gateway Summary" window (Fig. 18). Utilizing those clones with a very efficient cell-free protein synthesis system featuring wheat germ (15,16), we have produced a large number of human proteins in vitro. Expressed proteins were detected in almost all cases (12). Proteins in both total and supernatant fractions are shown in the "PE: Protein Expression" window (Fig. 19). In addition, we have determined subcellular localizations of human proteins fused with the fluorescent protein in HeLa cells (17). The image data are shown in the "SL: Subcellular Localization" window (Fig. 20). These biological data are presented on the frame of cDNA clusters in the Human Gene and Protein Database (HGPD, http://www.HGPD.jp) (18). To build the basic frame of HGPD, sequences of FLJ full-length cDNAs and others deposited in public databases (Human ESTs, RefSeq, Ensembl, MGC, etc.) are assembled onto the genome sequences (NCBI Build 35 (UCSC hg17)).

In the NEDO full-length human cDNA sequencing project (FLJ-PJ), in addition to about 30,000 human full-length cDNAs (FLJ cDNAs) (2), about 1,430,000 5'- and 3'-end sequences (ESTs) of full-length cDNAs were deposited to DDBJ/GenBank/EMBL (3). These were obtained from cDNA libraries consisting of mRNAs for about 100 kinds of human tissues and cells constructed using the oligo-capping method. The majority of the insert cDNA sizes were over 2 kb and the full-length rate of 5'-end was more than 90% (19). By developing efficient search and evaluation systems for splicing variant (SV) cDNAs, more than10,000 important splicing variant cDNAs have been obtained. Then the FLJ Human cDNA Database (ver. 3.0) displaying those data has been constructed and will be up shortly (19). Only the number of SV cDNAs, which is 5,020 in total, is presented in the "others in d box" of the "C1: cDNA Summary 1" window in HGPD (Fig. 15). Most of those SV cDNAs have also been converted to Gateway entry clones.

*) The majority of analysis data for cDNA sequences in HGPD are shared with the FLJ Human cDNA Database (http://flj.hinv.jp/) constructed as a human cDNA sequence analysis database focusing on mRNA varieties caused by variations in transcription start site (TSS) and splicing.

Category				Numbers
Human cDNAs	FLJ cDNAs	Entirely sequenced cDNAs	FLJ-PJ	30,063
		Entirely sequenced cDNAs	SV-PJ	5,020
		ESTs	5'-EST	1,323,199
		ESTs	3'-EST	107,239
	Other Public Database	Entirely sequenced cDNAs		49,650
		RefSeq and Ensembl		63,342
		ESTs		3,862,807
Gateway Entry Clone	N-Type			17,802
Gateway Entry Clone	F-Type			25,447
Protein Analysis Data	SDS-PAGE pictures of FLJ cDNAs			17,821
Protein Analysis Data	Subcellular localization pictures of FLJ cDNAs			10,917