Current Drug Discovery Technologies, 2005, 2, 55-67 QSAR Modeling of Carcinogenic Risk Using Discriminant Analysis and Topological Molecular Descriptors
Joseph F. Contrera*, Philip MacLaughlina, Lowell H. Hallb and Lemont B. Kierc
Center for Drug Evaluation and Research, Office of Pharmaceutical Science, U. S. Food and Drug Administration,Rockville, MD 20857; aMDL Information Systems, 200 Wheeler Road, Burlington, MA 01803; bDepartment ofChemistry, Eastern Nazarene College, Quincy, MA 02170; cDepartment of Medicinal Chemistry, School of Pharmacy,Virginia, Commonwealth University, Richmond, VA 23298, USAAbstract: A discriminant analysis model is presented for carcinogenic risk. The data set is obtained from the two-year rodent study FDA/CDER database and was divided into a training set of 1022 organic compounds and an external validation test set of 50 compounds. The model is designed to use as a decision support tool for a defined decision threshold, and is thus a binary discrimination into “high risk” and “low risk” categories. The carcinogenic risk classification is based on the method for estimating human risk from two-year rodent studies developed at the FDA/CDER/ICSAS. The paradigm chosen for this model allows a straightforward risk analysis based on historic information, as well as the computation of coverage, probability and confidence metrics that can further qualify the computed result. The molecular structures were represented as MDL mol files. The molecular structure information was obtained as topological structure descriptors, including atom-type and group-type E-State and hydrogen E-State indices, molecular connectivity chi indices, topological polarity, and counts of molecular features. The MDL®QSAR software computed all these descriptors. Furthermore, the discriminant analyses were all performed with the MDL®QSAR software. The reported model is based on fifty-three descriptors, using the nonparametric normal kernel method and the Mahalanobis distance to determine proximity. The model performed very well on the fifty compounds of the test set, yielding the following statistics: 76% correctly classified “high risk” (carcinogenic) and 84% correctly classified as “low risk” (non-carcinogenic). Keywords: Carcinogenicity, discriminant analysis, in silico, predictive toxicology, topological structure descriptors, QSAR, e-state, chemoinformatics. I. INTRODUCTION AND BACKGROUND
sponsoring company and the regulatory agency in the formof additional review cycles and time and effort invested in
Rodent carcinogenicity studies are required for the
failed applications. Predictive modeling can reduce the
marketing of most chronically administered drugs. These
likelihood of developing a compound that produces
studies are the most costly and time-consuming non-clinical
significant rodent tumors and can, therefore, lead to
regulatory testing requirement in the development of a drug.
significant savings for both the pharmaceutical industry and
The cost is approximately $2 million for a rat and mouse
the regulatory agency. The rodent carcinogenicity bioassay is
study, requiring 2 years of treatment, and at least an
also a pivotal component of food safety and environmental
additional 1-2 years for histopathological analysis and report
writing. The human carcinogenic potential of a compound isa property that cannot be evaluated in clinical trials and
Ready access to scientific knowledge is critical to
therefore safety decisions are made mainly on the basis of
support safety-related regulatory and product development
animal study results and risk/benefit considerations. The
decisions, particularly in situations where available
results of rodent carcinogenicity studies can have
experimental information is inadequate or unavailable, to
considerable impact on drug approvability. Even when
identify information gaps, and to prioritize research. A
rodent carcinogenicity findings do not prevent marketing,
current challenge is the development of better means to
they can seriously restrict the marketing of some products or
identify useful relationships and insights from large sets of
reduce their competitive advantage. Rodent carcinogenicity
data. Based on the major advances in computer technology,
studies are usually initiated relatively late in drug
chemoinformatics, and predictive toxicology, the
development when considerable resources and have already
accumulated results of rodent carcinogenicity studies in
been invested in a potential new product. Significant
public databases and FDA files can be more effectively used
carcinogenic findings at this stage of drug development can
to improve the scientific basis of regulatory and product
have disastrous and costly consequences for both the
development decisions and reduce the use of animals intesting. It is conceivable that over time with increasedexperience and confidence in carcinogenicity predictive
* Address correspondence to this author at the Center for Drug Evaluation
software, it may be possible to reduce carcinogenicity testing
and Research, Office of Pharmaceutical Science, U. S. Food and Drug
for compounds that have molecular structures that are highly
Administration, Rockville, MD 20857, USA; Tel: 301-827-5188; Fax: 301-
represented in the carcinogenicity database. This process
827-3787; E-mail: [email protected]1570-1638/05 $50.00+.00 2005 Bentham Science Publishers Ltd. 56 Current Drug Discovery Technologies, 2005, Vol. 2, No. 2 Contrera et al.
would reduce unnecessary testing and also free resources for
connectivity indices as a systematic organizer of a database,
testing compounds that are truly new molecular entities and
imparting a rich information source that can produce
are poorly represented in the carcinogenicity database.
potentially useful structure patterns [6-8]. This present studyrequires that the structure representation employed be able to
In recent years methods have been developed for
organize the data set molecules in such a way that those with
grouping together molecules with similar molecular
high potential for a particular property (i.e. carcinogenic
structures, based on the use of topological structure
risk) be more closely associated with each other than with
information. Over the past decade there has been a
those that are associated with another property such as low
significant growth in the use of similarity-based searching of
carcinogenic risk. Previous work appears to indicate such a
databases in drug design and these methods have been
possibility [6-9]. The ability of the simple molecular
shown to have a broader application [1-4]. The objective has
connectivity indices to organize a set of skeletal structures
been to organize a database of molecules according to a set
has been demonstrated [9]. Molecular skeletons are grouped
of structure criteria so that compounds can be identified as
in a meaningful manner. Furthermore directions within the
being similar to a reference or target molecule. These similar
representation space have meaning in terms of significant
compounds become candidates for screening or further
chemical information such as degree of skeletal branching,
analysis in the design process. The rationale is that
adjacency of branch points, and number of rings and types of
compounds that are similar to a reference molecule are likely
fused ring systems. The atom-type E-State structure
to be related to the behavior of the reference molecule in
descriptors have also been shown to organize molecular
some sense. With the growth of combinatorial chemistry, the
structures in a chemically meaningful manner, emphasizing
compounds in a database may be entirely or partially virtual;
electronic information [10,11]. Based on the structure space
in other words, they are synthesized in silico. As a result,
provided by the atom type E-State descriptors, excellent
there may be no property value information with the
similarity searches through a chemical database have been
molecules; hence, similarity is based entirely on the
reported [7,8,10]. This combination of structure information
structural descriptions chosen in a particular study. There is
representations provided the basis for the use of structure
thus no useful way of evaluating similarity based on physical
similarity methods together with topological descriptors that
properties except by virtue of the future success of the drug
have recently been applied to QSAR modeling of rodent
design project employing this general method.
Lajiness has shown quite clearly that a random search
The result of these investigations indicates that the use of
through a list of molecules is inferior to a search through an
the atom-type E-State descriptors together with the
organized database, based on its ability to generate similarity
molecular connectivity chi indices provides a structure space
or diversity in a study [3]. Some form of encoding structure
in which molecular structures are organized in chemically
information should be present for meaningful exploitation of
meaningful ways so that carcinogenic properties associated
a database. The code of structure information thus becomes
with those structures can also be expected to be usefully
the metric to evaluate similarity or its complement, diversity.
organized. As a result, statistical methods of analysis can be
This approach is not an exercise in multi-parameter QSAR
successfully applied to a data set based on E-State and
modeling. With virtual molecules, many or all of the
molecular connectivity descriptors, as is demonstrated in this
property values are unknown. The search is conducted by
selecting a set of descriptors deemed important and findingthe relation of molecules relative to a reference molecule
II. EXPERIMENTAL DATA AND METHODS
using a metric such as distance or a grouping such as nearestneighbors. The objective is to create a cluster of molecules of
The FDA/CDER Rodent Carcinogenicity Database
potential interest based on several structure indices. Interesting compounds may appear that can be selected for
The FDA/CDER Rodent Carcinogenicity database was
screening or for further applications in the database search
created from summary rat and mouse carcinogenicity study
findings for over 1300 compounds that include bothindustrial chemicals and pharmaceuticals. Rodent
The encoding and subsequent searching can be a
carcinogenicity study results in the FDA database were
browsing process, using electrotopological state indices (E-
obtained from the National Toxicology Program (NTP)
State) values or other information-rich indices, such as
rodent carcinogenicity database, the Lois Gold Carcinogen
molecular connectivity, removing the need for carefully
Potency (CPD) Database [13], FDA/CDER archives, and the
delineated structural features which may be unknown or
scientific literature. The database includes the name and
which can severely limit diversity. The choice of limiting
identification codes, the chemical structure represented as an
distance values among molecules in the database makes it
MDL MOL file, and numeric carcinogenic activity units
possible to reduce the number of output molecules. A
assigned (discussed below) to each compound.
qualitative advantage of this process is the stimulation of thechemist's imagination [5]. Acceptance Criteria for Carcinogenicity Studies
A large number of descriptors are available to be
Most of the carcinogenicity study results for
employed in the organization of a database. It is not our
pharmaceuticals in the FDA database were derived from
intention here to create a list of these or to make
pharmacology/toxicology and biostatistics reviews and
comparisons, each method being suitable for different
reports in FDA files. The results of carcinogenicity studies in
circumstances. Our intention however is to build on the use
FDA new drug application (NDA) regulatory reviews for
of atom-type E-State descriptors along with molecular
marketed products are available under the Freedom of
QSAR Modeling of Carcinogenic Risk Current Drug Discovery Technologies, 2005, Vol. 2, No. 2 57
Information Act and are considered non-proprietary. The
equivocal single site responses were assigned an activity
identity of pharmaceuticals currently under regulatory
value of 30. Studies with no statistically significant
review as an investigational new drug application (IND) or
treatment-related tumor findings were assigned an activity
new drug application (NDA) or drugs that have never been
value of 10. Compounds with 30 or more activity units in 2
marketed are proprietary and cannot be disclosed without the
or more study cells (2Plus) , that is, having activity that
consent of the sponsor. Proprietary compounds represent
crossed the biological barrier of gender or species, were
approximately 8 % of the total number of carcinogenicity
classified as high risk carcinogens. Compounds with less
studies in the FDA/CDER database and are coded in this
than 30 activity units in 3 or more study cells were
considered not to be high risk carcinogens. Compounds thatwere tested only in the rat or mouse may also be considered
Carcinogenicity Study Design and Analysis
positive if there were significant tumor findings in bothmales and females. Compounds tested only in one species
The design of rodent carcinogenicity studies for
that have no tumor findings cannot be considered negative
pharmaceuticals is essentially the same as the design
without additional information from at least one other study
employed for industrial and environmental chemicals and
cell. Applying these rules, a training carcinogenicity
U.S. National Toxicology Program (NTP) rodent
database was created containing 1022 compounds with 4 cell
carcinogenicity studies. Male and female rats and mice are
or equivalent data of which 649 compounds were classified
divided randomly into one or two control and three treatment
as carcinogenic (High Risk), having tumor findings in at
groups of 50-70 animals per group per species. Historically,
least 2 study cells, and 373 compounds were non-
the highest dose in the studies analyzed generally
carcinogenic (Low Risk) with negative findings in 3 or more
approximates the maximum tolerated dose (MTD) in the test
study cells. The greater number of positive compounds is
species, and is administered daily, usually in the feed or by
partly a function of the scoring method employed. This
oral gavage for 2 years. The rodent strains most often used in
scoring method is the same as that used to predict rodent
NTP studies is the inbred Fisher 344 rat and the hybrid
carcinogenicity based on molecular similarity [12] and is a
B6C3F1 (C3H x C57B16) mouse. In pharmaceutical studies
simplification of the multi-cell method used for MCASE-ES
submitted to the FDA, the predominant rodent strains are the
rodent carcinogenicity predictions [16].
Sprague-Dawley derived CD rat, and the CD-1 Swiss-Webster derived mouse. Despite the long experience in the
The name and structure of proprietary compounds were
FDA with these assays, the significance of tumors from life-
coded and kept confidential by the FDA. Electrotopological
time exposure at the maximum tolerated dose, the dose
descriptors derived from proprietary molecules were
response extrapolation and the relevance of rodent tumors to
included in the training data set. Although electrotopological
humans continue to be highly controversial issues.
state and other topological descriptors employed containsufficient information for successful modeling they are
Classification and Stratification of Rodent Tumor
insufficient to unambiguously recreate a proprietary
Findings
In studies reviewed by FDA/CDER, tumor findings are
A validation experiment employing a total of 50 test
classified as positive if either benign and/or malignant
compounds that were not part of the MDL®QSAR (see
findings are statistically significant in pair-wise comparison
below) control or training data set were used in this
to concurrent controls by Fisher's Exact Test or equivalent
investigation. The 50 test compounds were randomly
statistical analysis. An adjustment for rare and common
removed from the 1072 compound rodent carcinogenicity
events is also applied to tumor findings [14]. Tumors are
training set. The carcinogenicity model was based on the
considered significant if they attained a level of p ≤ 0.01 for
remaining 1022 training set compounds. The 50 randomly
common tumors and p ≤ 0.05 for rare tumor types. Rare
selected test compounds included 38 pharmaceuticals of
tumors are those with a spontaneous background incidence
which 9 (18%) were newer pharmaceuticals currently under
rate equal to or less than a 1%. The incidence of benign and
regulatory review that are not yet marketed (structures and
malignant tumors (adenomas and carcinomas) are combined
identity not disclosed) and 12 industrial chemicals. The 50
and statistically evaluated where appropriate [15].
validation test compounds contained 25 “High Risk”compounds with tumor findings in two or more study cells
Data Transformation: The Numeric Representation of
(2Plus) and 25 “Low Risk” compounds with either no tumor
Carcinogenic Activity
findings or findings in only 1 study cell. Table 3 lists the compounds, their assigned risk level from the FDA/CDER
Carcinogenicity studies are generally carried out in male
Rodent Carcinogenicity database and the risk level as
and female rats and mice. Each sex/species is considered an
predicted from the model presented in this work.
individual study cell and therefore a complete battery ofcarcinogenicity studies for a compound is comprised of 4
III. COMPUTATIONAL METHODS
study cells. A simplified numerical activity scale was used toquantify and stratify the results of rodent carcinogenicity
Descriptors and Descriptor Selection
studies. Compounds that produce statistically significant (by
The MDL®QSAR module implements molecular
pair-wise comparison) tumors at multiple organ/tissue sites
topological descriptors available within the Molconn-Z
in a study cell were assigned the highest activity value of 50.
program [17a, 17b]. (A list of publications that illustrate the
Compounds that produce statistically significant single site
nature of topological descriptors and their applications is
tumors received an activity value of 40 and weaker or
available [17c].) An initial set of 195 topological descriptors
58 Current Drug Discovery Technologies, 2005, Vol. 2, No. 2 Contrera et al.
was computed by the MDL®QSAR module for the entire
or the diagonal matrix of variances can be used to calculate
training set of 1022 compounds that were tested for rodent
the Mahalanobis distances. In our studies, various
carcinogenicity. The descriptors included atom-type, group-
combinations of model building preferences have been
type, and individual atom E-State and hydrogen E-State
explored to achieve a model with the highest accuracy. The
indices, molecular connectivity chi indices, kappa shape
performance of each candidate model was assessed by
indices, topological polarity, counts of molecular features
making use of the prediction error rate in the training set
(number of rings, number of H-bond donors and acceptors,
(i.e., probabilities of misclassification). For each model, the
etc), and others. This initial set was reduced using the
true positive (TP), true negative (TN), false positive (FP),
following criteria: first, descriptors were only considered that
and false negative (FN) rates were studied both for re-
had non-zero value for at least 95% of all compounds and,
substitution analysis as well as those resulting from leave-
second, the variance of the descriptor values had to be no
one-out cross-validation. The computation of each model
less than a certain threshold, set equal to 1.
studied took several seconds on a Pentium 4 processor with2.8 MHz and 1 Gb memory. Model Development IV. COMPUTATIONAL STUDIES
The compounds in the FDA rodent carcinogenicity
dataset are characterized either as carcinogenic or non-
The task of finding the best model falls into two
carcinogenic; therefore, this dataset presents a typical
interconnected parts: the search for the best subset of
example of a binary classification problem. For the analysis
descriptors and the selection of the type and optimal
of such datasets, MDL®QSAR employs methods
parameters of the model. Both these subtasks admit no
discriminant analysis. The complete description of this
formal algorithmic solution and require some
method analysis can be found in a number of textbooks and
experimentation to achieve the best solution. In our studies,
monographs [18,19]. Herein, we provide a short description
more than 3000 discriminant analysis models were built in
of the method pertinent to its implementation within MDL®
total, using different criteria for descriptor reduction and
various parameters of discriminant analysis as described inthe Methods Section. The best model included 53 variables
MDL® QSAR incorporates the algorithms to develop
(Table 1) and was characterized by the following parameters:
discriminant models and the graphics interface that allows
normal kernel; smoothing parameter of 2; Mahalanobis
users to input data sets, initiate calculations, analyze and
distance to determine proximity; distance calculations based
manipulate resulting models. Each model is characterized by
on the full individual within-group covariance matrices;
rich statistics available to the user. MDL®QSAR
implements the entire range of discriminant analysismethods such as parametric, nonparametric kernel, and
The correct classification rates for the best discriminant
nearest-neighbor approaches. The classic parametric method
model, which is supplied with MDL® QSAR, are shown in
of discriminant analysis is applicable in the case of
Table 2. Total accuracy of the model, both for C re-
approximately normal within-class distributions. The method
substitution and LOO cross-validation analyses is shown as
generates either a linear discriminant function (the within-
well as separate data for the test set prediction of
class covariance matrices are assumed to be equal) or a
carcinogenic (high risk carcinogens) as well as non-
quadratic discriminant function (the within-class covariance
carcinogenic (low risk carcinogen) compounds.
matrices are assumed to be unequal). Our initialchemometric analysis of the FDA data set demonstrated that
V. DISCUSSION
the distribution of the descriptor values did not follow the
Risk Mitigation
Gaussian law, which was indicated by the normaldistribution hypothesis testing with the confidence level of
The modeling approach in this study was chosen
0.01. When the distribution is not assumed to follow a
specifically for a risk analysis approach. Unexpected results
particular law or is assumed to be other than the multivariate
in long-term carcinogenicity bioassays on a new drug
normal distribution, nonparametric methods can be used to
candidate can be extremely costly in time, money, and
derive classification criteria. The nonparametric methods
market viability. With today’s technology, applicants must
available within the MDL® QSAR include the kernel and k-
carry the risk of long-term carcinogenicity well into phase 2-
nearest-neighbor (kNN) methods. The main types of kernels
3 clinical development with little or no mitigation. At this
implemented in MDL® QSAR include uniform, normal,
stage of development, even a single failure can result in a
Epanechnikov, biweight, or tri-weight kernels, which are
huge loss: the late-stage non-approval of a drug can mean the
used to estimate the group specific density at each
loss of $700 million or more [20], and represent the loss of
six to eight years of development effort.
In general, either Mahalanobis or Euclidean distances can
The chosen modeling paradigm, in its selection of
be used to determine proximity between compound-vectors
chemical structure variable type, its identification of a real
in multidimensional descriptor space. When the k-nearest-
world risk threshold in endpoint definition, and its
neighbor method is used, the Mahalanobis distances are
straightforward statistical method, can be called ‘actuarial’ in
based on the pooled covariance matrix. When a kernel
approach. This allows the user of the model to view various
method is used, the Mahalanobis distances are based on
confidence and applicability measures and restrict acceptable
either the individual within-group covariance matrices or the
ranges as desired. Two such calculated metrics are the
pooled covariance matrix. Either the full covariance matrix
Distance and the Probability of Membership in Class. The
QSAR Modeling of Carcinogenic Risk Current Drug Discovery Technologies, 2005, Vol. 2, No. 2 59 Table 1. Descriptors selected in the best model for the prediction of carcinogenicity. The descriptors appear in groupings that relate to their structure information content. A ranking is also given, based on descending order of F-value for inclusion of the descriptor in the model. A brief definition is given for each descriptor along with an illustration for a selected few descriptors. For more specific information on structure interpretation see the appropriate references [6-11,17d]. Description
Encode degree of skeletal branching and molecular size. Low order indices x0 and x1 increase with
Low Order Chi
molecular size and decrease with increased branching. Chi 2 (x2) shows the greatest sensivity to
differences in branching and increases with branching. Valence indices add information aboutheteroatoms and valence state. Simple illustration for x2 given below.
Simple chi 0 index decreases minimally with increased branching, insensitive to adjacency.
Simple chi 1 index encodes degree of branching and decreases with increased branching.
Simple chi 2 index gives high sensitivity and increases with increasing branching.
Valence chi 0 index is highly intercorrelated with molecular surface area and volume.
Valence chi 1 is similar to x1 but also includes heteroatom and valence state information.
Valence chi 2 index includes heteroatom and valence state information with high sensitivity.
Encode complexities and specifics of overall skeletal variation, including degree of branching andmolecular size. Each higher order path index encodes different aspects of skeletal variation. Valence
Path Chi Indices
indices add information about heteroatoms and valence state. A simple illustration for xp3 is givenbelow.
Simple chi path 3 is sensitive to adjacent branch points in the molecular skeleton.
Simple chi path 4 is sensitive to branch points separated by one atom in the skeleton.
Simple chi path 5-8 encode specific skeletal information to disciminate among skeletal classes.
Valence chi path 3 is similar to xp3 with aditional heteroatom and valence state information.
Valence chi path 4 is similar to xp4 with aditional heteroatom and valence state information.
Valence chi path 5 is similar to xp5 with aditional heteroatom and valence state information. Cluster & Path/Cluster
Encode structure information specifically based on a branch point, emphasizing the immediate branch
Chi Indices
point environment. Simple illustration for xvpc4 given below.
Simple chi cluster 3 index is defined for a single branch point and encodes the number and branching
Simple chi path-cluster 4 index is defined for the isobutane skeleton and is espeically sensitive to
adjacency of skeletal branch points.
Valence path-cluster 4 index encodes information similar to xpc4 but with heteroatom and valence
Knotp gives the difference between chi cluster-3 and chi path/cluster-4 descriptor. Knotp is largestwhere an xc3 subgaph is not associated with an xpc4 subgaph. Each path cluster-4 (xpc4) subgraphcontains a cluster-3 (xc3) subgraph and one additional atom. Each xc3 subgraph may be associatedwith up to three of these additional atoms and thus be contained within up to 3 xpc4 subgraphs, asshown in the table below. The knotp descriptor helps to separate this overlapping structure informationinto distinct numerical values. 60 Current Drug Discovery Technologies, 2005, Vol. 2, No. 2 Contrera et al. (Table 1) contd.…. Description
Encode for a specified atom type a combination of electron accessiblity, presence/absence, and the
Atom Type
count of the atom type in the molecule based on the electrotoplogical state indices. The atom type E-
State has been shown to be very useful for similarity analysis and structure classification and modelingof various properties [6,10,11].
Sum of the atom level E-state values for all non-substituted
Sum of the atom level E-state values for all the substituted
Sum of the atom level E-state values for all the methylenes
Sum of the atom level E-state values for all the carbon
atoms in methyl groups in the molecule.
Sum of the hydrogen atom level E-state values for all the
Sum of the atom level E-state values for all the oxygen
Sum of the atom level E-state values for all the ether oxygen
Sum of atom level E-statevalues in molecule
Sum of the atom level E-state values for all the nitrogen
Sum of the atom level E-state values for all the chlorine
Sum of the atom level E-state values for all the double
bonded oxygen atoms in the molecule. Atom Type Count
Number (count) of all non-substituted aromatic carbon atoms in the molecule.
Number (count) of methylene groups in the molecule
Number (count) of substituted aromatic carbon atoms in the molecule
Number (count) of =C< groups in the molecule
Number (count) of methyl groups in the molecule
Number (count) of double bonded oxygen molecules in the molecule
Internal Hydrogen
The largest single product of E-state and H E-state values from all acceptor and donor pairs separated
Bonding E-state
by 4 skeletal bonds and not part of a rigid skeletal structure.
Donor acceptor pair do not form an internal hydrogen bond.
This group is associated with acids, amides, etc.
Forms 5-membered ring for potential internal H bond.
Forms 6-membered ring for potential internal H bond. QSAR Modeling of Carcinogenic Risk Current Drug Discovery Technologies, 2005, Vol. 2, No. 2 61 (Table 1) contd.…. Description Molecular
A group of structure descriptors that encode a general aspect of structure information for the whole
Properties
A whole molecule polarity index that decreases in value as the polarity increases and more sensitive to
Number of graph verticies (non-hydrogen atoms) in the moleclue.
Sum of the Hydrogen E-state values for hydrogens on carbon atoms.
The maximum hydrogen atom level E-state value in a molecule.
The maximum positive hydrogen atom level E-state value in a molecule.
Number (count) of (independent) rings in the molecule.
A kappa shape molecular flexibility descriptor that increases with homologation and decreases with
increased branching or cyclicity. Larger Phia values indicate greater molecular flexibility.
The maximum atom level E-state value in a molecule.
Sum of the hydrogen atom level E-state values for all hydrogens bonded to donating atoms.
Number (count) of hydrogen bond donors in the molecule.
Number (count) of hydrogen bond acceptors in the molecule.
Number (count) of chemical elements in the molecule.
Number (count) of graph circuits in the molecule.
A whole molecule E-state polarity index that decreases in value as the polarity increases. Table 2. Carcinogenic Risk Prediction Accuracy Training set, 1022 compounds Test set, 50 compounds
calculated Distance shows whether the subject compound
compound test set, overall coverage was reduced to 76%,
vector is adequately represented within the historic variance
Sensitivity rose from 76% to 87%, Specificity remained
of chemical structure descriptors. The Probability of
essentially constant at 84% (83%), and Concordance
Membership in Class is a measure of how well the historic
improved from 80% to 84% (See Table 4). By placing a
knowledge is able to discriminate high risk compounds from
minimum of 65% Probability in Class, Coverage was 70%,
low risk compounds within the nearest space of the subject
Sensitivity 93%, Specificity 86% and Concordance 89%
(See Table 5). Exercising this option allows a flexibility in how the model is employed, perhaps allowing a wider range Probability of Membership in Class
of acceptable probability in screening large compound
The results for prediction are given in Table 3 along with
libraries to glean general characteristics, while restricting
the original rodent data. The prediction rates are 84% correct
this range when assessing safety risks in lead compounds for
for low risk and 76 % for high risk with an overall rate of
80% correct. Incorrect predictions are marked in bold. In thispresent study, we found that by placing limits on probability
Distance Measure
in class we could trade overall coverage for accuracy. By
MDL®QSAR evaluates two quantitative measures of
placing a minimum of 60% Probability-in-Class on our 50
applicability of data models to new observations: 1. regression
62 Current Drug Discovery Technologies, 2005, Vol. 2, No. 2 Contrera et al. Table 3. List of Compounds used in the Validation Test Set along with the rodent data. Compounds showing only a number for “Name” are currently confidential at the FDA because they are under regulatory review. Predicted values are shown in bold for incorrect prediction. Predictions shown here are made without regard to probability level. See Tables 4 and 5 for tabulations based on selected probability ranges. (See text for details). See Experimental Data and Methods in text, pages 5-8, for detailed description of the data fields in Tables 3, 4 & 5. Female Rat Male Mouse Female Mouse Predicted ‘HIGH’ ‘HIGH’ ‘HIGH’ ‘HIGH’ ‘LOW’ ‘LOW’ ‘LOW’ ‘LOW’ ‘LOW’ ‘LOW’ QSAR Modeling of Carcinogenic Risk Current Drug Discovery Technologies, 2005, Vol. 2, No. 2 63 Table 4. List of test compounds together with the posterior probability for classification based on a 60% probabilit-in-class dividing line (See text). Compounds showing only a number for “molecule” are currently confidential at the FDA because they are under regulatory review. Compounds ‘not covered’ using this threshold are outlined in grey. Posterior Probability of Membership in Class Molecule Experimental Predicted 64 Current Drug Discovery Technologies, 2005, Vol. 2, No. 2 Contrera et al. (Table 4) contd.…. Molecule Experimental Predicted 60% Minimum Probability Table 5. List of test compounds together with the posterior probability for classification based on a 65% probability-in-class dividing line (See text). Compounds showing only a number for “molecule” are currently confidential at the FDA because they are under regulatory review. Compounds ‘not covered’ using this threshold are outlined in grey. Posterior Probability of Membership in Class Molecule Experimental Predicted QSAR Modeling of Carcinogenic Risk Current Drug Discovery Technologies, 2005, Vol. 2, No. 2 65 (Table 5) contd.…. Molecule Experimental Predicted 65% Minimum Probability
models and 2.discriminant analysis models. These are based
Having chosen the distance in the observation space, one
on the following simple assumption: each constructed model
should also decide which distances are to be considered
has a certain applicability region in the space of independent
“large” or “small“. In the case of regression models, it is
variables. Specifically, if our molecule is found to exist in an
natural to use an obvious analogy between outliers in the
observation space “far” from the set used to build the model,
training set and “far-flung” observations. First note that the
the prediction for the object should be treated with caution,
sum of Mahalonobis distances for all observations from the
with a less degree of confidence than in the case when the
training set is p(n – 1). Consider quantity
model built used objects found “nearer” to our observationspace.
Let X be a n*×p matrix of data with columns being n-
which is referred to as centered leverage value and forobservations from the training set lies between 0 and 1.
dimensional vectors of variables Xi and rows being p-dimensional vectors of observations x
Based on the rules for separation of outliers in the
observation space, which are recommended in regressionanalysis, the degree of applicability of a regression model to
1 − mX
an object that is not a member of the training set is evaluated
2 −mX
If d exceeds p/n, its average across the training set, more
than twofold, one should treat the prediction for such a case
n − mX
If d > 0.5, the degree of applicability of a model to an
be a centered matrix of observations, and A = (1/(n - 1))X T
object is taken to be very low; if 0.2 < d ≤ 0.5 low, and if d ≤
0.2 we consider it optimal to use the model.
_ be a covariance matrix. A reasonable measure of
proximity of a molecule to the training set in the observation
For discriminant analysis models, such as the model
space is the Mahalonobis distance, evaluated for new
contained in the MDL® Carcinogenicity Module, similar
methods for separation of outliers are not used. In order topartition distances into “large” and “small”, an approach to
data standardization is applied that is traditional for statistics.
Suppose that distance D (Euclidian, see definition above) is
and its special case (at A = E), common Euclidean space D:
normally distributed across the general population of data. Then, the standardized distance d1 (the difference between
the distance and its sample mean over the training set)
66 Current Drug Discovery Technologies, 2005, Vol. 2, No. 2 Contrera et al.
divided by the sample estimate of mean-square deviation,
modeling but are insufficient to unambiguously recreate a
will approximately follow a normal distribution with
molecular structure, making this a valuable tool for data
parameters 0, 1. Thus probabilities for d1 to fall into one or
sharing while preserving confidentiality.
another interval of the real axis are found from the tabulated
Computational toxicology combines databases and
values of the Laplace function. For example, consider points
chemoinformatic data mining techniques with statistical
x0 = 1.65 and x1 = 2.33 marked on the axis, such that the
methods to identify relationships between chemical
probability to fall to the left of them is, respectively, 0.95
structures and toxicological activities. Computational or
and 0.99. If a new observation produces value d1 ≥ x1, the
predictive toxicology software programs are a means of
degree of applicability of a model to such an object may be
evaluating knowledge accumulated from decades of
taken to be very low; if x0 ≤ d1 < x1 – low.
toxicology studies to provide effective regulatory and
VI. CONCLUSIONS
product development decision support information. Thisapproach is especially useful for prioritizing potential hazard
The FDA/CDER rodent carcinogenic database provides a
and identifying data gaps in situations where toxicological
sound basis for development of a model for the prediction of
data is limited, e.g., indirect food additives or contaminants
carcinogenicity risk. The combination of the experimental
and degredants/contaminants in the pharmaceutical
data and the experience in the FDA provided the basis for
manufacturing process. In drug development, the application
qualifying 1072 compounds for the database. The
of combinatorial chemistry and high throughput screening
MDL®QSAR software provided a useful basis for
has resulted in an unprecedented increase in the number of
calculating the molecular descriptors and performing the
compounds identified with potentially desirable
discriminant analysis to establish a classification algorithm.
pharmacological properties. The selection of lead
The topological structure descriptors included
compounds for development is currently hampered by
electrotopological state (E-State) descriptors and molecular
limitations in the available toxicity screening methods.
connectivity chi indices that have been shown to provide a
Making better use of accumulated scientific knowledge
sound basis for classifying molecular structures. A non-
incorporated in predictive software is one way to minimize
parametric method was used to obtain the final model based
toxicity related drug failures and improve pharmaceutical
on the normal kernel method. Descriptors were selected by
risk management. Identifying serious potential toxicity early
examining models with varying numbers of descriptors and
in the drug development process before significant
deciding upon the model with the best classification statistics
investments in time and resources are expended is a major
on the training set. The discriminant model presented heredemonstrates good prediction statistics on the external
goal for FDA/CDER and the pharmaceutical industry. A
validation test set of fifty compounds with sensitivity of 76%
current cause for concern is that too many drugs are failing
and specificity of 84% in addition to concordance of 80%.
late in the development process in phase III either for lack of
This test set includes 38 pharmaceuticals and 12 industrial
efficacy or toxicity. It is estimated that 20% of total R&D
chemicals. Nine of the pharmaceuticals are newer
costs per drug are spent on compounds that ultimately fail
compounds that are still under regulatory review and
due to unfavorable ADME/Toxicity. The selection of drug
confidential. Twenty-five of the test set compounds are
candidates with better safety profiles will also reduce the
considered high risk. Based on these results the model
regulatory review burden and speed the approval process by
appears useful as an indicator of potential carcinogenicity
reducing the number of drugs submitted with serious safety
risk for candidate molecules in a design process and for
issues that necessitate multiple review cycles or result in
termination. Review resources expended for drugs that nevermake it to market are lost and could better be used for
Data transformation is an essential component in the
QSAR modeling of carcinogenicity from rodent bioassay. This process converts tumor incidence findings into
Computational or predictive toxicology has potential
weighted numerical form with the highest score given to
regulatory and drug development applications that can
multi-site and trans-species tumor findings. This simulates
ultimately benefit the public health as well as refine and
aspects of the weight of evidence process used in regulatory
reduce the use of animals in the assessment of safety.
risk analysis. Tumor site is not considered in this modelingprocess because there is poor tumor site concordance
ACKNOWLEDGEMENT
between rats and mice making it a poor factor for QSARmodeling [21].
We wish to express our appreciation to Vladimir
Shwartz, University of St. Petersburg, St. Petersburg, Russia,
Converting molecular structure into electrotopological
for his assistance with the discriminant analysis and related
state (E-State) descriptors and molecular connectivity chi
indices also provides a means for modeling proprietarymolecular structures that does not disclose the exact structure
REFERENCES
and identity of proprietary molecules. In this reportproprietary compounds were included in the training data set
Willett P.: Three-Dimensional Chemical Structure
and in the 50 test compounds. The name and structure of
Handling; John Wiley & Sons: New York, (1991).
proprietary compounds was encoded and kept confidential
Lajiness M.S.: Molecular Similarity-Based Methods
by the FDA. The proprietary structure information
for Selecting Compounds for Screening, In
descriptors contained sufficient information for successful
Computational Chemical Graph Theory, Rouvray,
QSAR Modeling of Carcinogenic Risk Current Drug Discovery Technologies, 2005, Vol. 2, No. 2 67
D.H. Ed.; Nova Science: New York, pp. 300-312,
bioassays published in the general literature through1988, by the National Toxicology Program through1989. Environrn. Health Perspect. 100, 65-135,
Johnson M., Maggiora G.M.: Concepts andApplications of Molecular Similarity: John Wiley &Sons: New York, (1990).
Haseman J.K.: A re-examination of false-positiverates for carcinogenicity studies. Fundam. Appl.
Willett P.: Similarity and Clustering in ChemicalInformation Systems, John Wiley & Sons: New York,(1987).
McConnell E.E., Solleveld H.A., Swenberg J.A.,Boorman G. A.: Guidelines for combining neoplasms
Warr W.: Chemical Structures. The Internationalfor evaluation of rodent carcinogenesis studies. JNCI,
Language of Chemistry, Springer: Berlin, (1988).
Hall L.H., Kier L.B.: Electrotopological state indices
Matthews E.J., Contrera J.F.: A new highly specificfor atom types: A Novel combination of electronic,method for predicting the carcinogenic potential oftopological and valence state information, J. Chem. pharmaceuticals in rodents using enhanced MCASE
Inf. Comput. Sci. 35, 1039-1045, (1995). QSAR-ES software. Regul. Toxicol. Pharmacol. 28,
Kier L.B.; Hall L.H.: Molecular StructureDescription: The Electrotopological State: Academic
a) MDL Information Systems, 200 Wheeler Road,
Hall L.H., Kier L.B.: Molecular Connectivity Indices
b) Kellogg, G. E.; Hall, L. H.; Molconn-Z. See
for Database Analysis and Structure-PropertyModeling, in Topological Indices and RelatedDescriptors in QSAR and QSPR, Devillers, J. and
c) For a list of publications illustrating applications
Balaban, A. T. Eds.; pp. 307-360, (1999).
of the descriptors in Molconn-Z, seehttp://www.eslc.vabiotech.com/molconn/mconpubs.ht
Hall L.H.; Kier L.B.: Issues in the representation ofmolecular structure: The development of molecularconnectivity. J. Molecular. Model. Graphics 20, 4-18,
d) See MDL®QSAR Users Guide for specific
illustration of topological descriptors.
Kier L.B., Hall L.H.: Database organization and
Anderson T.W.: An Introduction to Multivariatesimilarity searching with E-State indices. SAR QSAR
Statistical Analysis, Second Edition, John Wiley &
Hall L.H.: Astructure-information approach to
Kendall M.G., Stuart A., Ord J.K.: The Advancedprediction of biological activities and properties. Theory of Statistics, Macmillan Publishing: New
Chem. Biodiversity 1, 183-201, (2004).
York, Vol. 3, Fourth Edition, (1983).
Contrera J.F., Matthews E.J., Benz R.D.: Predicting
Grabowski H., Vernon J., DiMasi J.: Returns onthe carcinogenic potential of pharmaceuticals inresearch and development for 1990s new drugrodents using molecular structural similarity and E-introductions. Pharmacoeconomics 20, (Suppl. 3), 11-
state indices. Regul. Toxicol. Pharm. 38, 243-259,
Contrera J.F., Jacobs A.C., DeGeorge J.J.:
Gold L.S., Manley N., Slone T., Garfinkel G.,
Carcinogenicity testing and the evaluation of
Rohrback L., Ames B.N.: The fifth plot of theregulatory requirements for pharmaceuticals. Regul. carcinogenic potency database: Results of animal
Toxicol. Pharmacol. 25, 130 -145, (1997).
A.N.U. B.PHARMACY SYLLABUS (WITH EFFECT FROM 2008 - 09 ACADEMIC YEAR)(BIOPHARMACEUTICS, PHARMACOKINETICS & NEW DRUG DELIVERY SYSTEMS)Unit : 01Biopharmaceutics :Introduction , Definitions, Fate of drug after administration , Blood level curves,Routes of drug administration, Drug absorption and disposition . Significancein product, formulation and development. Drug absorption –Structure of b
SCHEDULE- I. [ See rules 56(a), 70(a) and 71 ] Manner of test and examination before taking lifting appliance, lifting gear and wire Test Loads: 1. Lifting appliances . - Every lifting appliance with its accessory gear, shall be subjected to a test load which shall exceed the safe working load (SWL) as specified in the following Safe Test load. Test load. 25 percent in exc