Published: 9th June 2013
DOI: 10.4204/EPTCS.116
ISSN: 2075-2180


Proceedings Fourth International Workshop on
Computational Models for Cell Processes
Turku, Finland, 11th June 2013

Edited by: Ion Petre

Ion Petre
Invited Presentation: Computational Methods in Systems Biology: Case Studies and Biological Insights
Daniela Besozzi
Invited Presentation: Computational Methods for Metabolic Networks
Juho Rousu
Canonical Labelling of Site Graphs
Nicolas Oury, Michael Pedersen and Rasmus Petersen

Informal Presentations

Rule-Based Modeling of Polymerization Reactions
Thilo Krüger and Verena Wolf
Robustness Analysis of Stochastic Systems
Luboš Brim, Milan Češka, Sven Dražan and David Šafránek
Mathematical modelling of the platelet-derived growth factor signalling pathway
Andrzej Mizera, Jun Pang, Thomas Sauter and Panuwat Trairatphisan
Rule-based modelling as a platform for the analysis of synthetic self-assembled nano-systems
Abdulmelik Mohammed and Eugen Czeizler
An Algebraic Approach to Gene Assembly in Ciliates
Robert Brijder
Large-Scale Text Mining of Biomedical Literature
Filip Ginter


The fourth international workshop on Computational Models for Cell Processes (CompMod 2013) took place on June 11, 2013 at the Åbo Akademi University, Turku, Finland, in conjunction with iFM 2013. The first edition of the workshop (CompMod 2008) took place in Turku, Finland, in conjunction with Formal Methods 2008, the second edition (CompMod 2009) took place in Eindhoven, the Netherlands, in conjunction with Formal Methods 2009, and the third one (CompMod 2011) took place in Aachen, Germany, in conjunction with CONCUR 2013. This volume contains the final versions of all contributions accepted for presentation at the workshop.

The goal of the CompMod workshop series is to bring together researchers in Computer Science and Mathematics (both discrete and continuous), interested in the opportunities and the challenges of Systems Biology. The Program Committee of CompMod 2013 selected 3 papers for presentation at the workshop (one was withdrawn). In addition, we had two invited talks and five other informal presentations. We thank the PC members for their excellent work in making this selection. The CompMod 2013 Program Committee consisted of:

The scientific program of the workshop spans an interesting mix of approaches to systems and even synthetic biology, encompassing several different modeling approaches, ranging from quantitative to qualitative techniques, from continuous to discrete mathematics, and from deterministic to stochastic methods. We thank our invited speakers

for accepting our invitation and for presenting some of their recent results at CompMod 2013. The technical contributions address the mathematical modeling of the PDGF signalling pathway, the canonical labelling of site graphs, rule-based modeling of polymerization reactions, rule-based modeling as a platform for the analysis of synthetic self-assembled nano-systems, robustness analysis of stochastic systems, an algebraic approach to gene assembly in ciliates, and large-scale text mining of biomedical literature.
We would also like to thank the editorial board of the Electronic Proceedings in Theoretical Computer Science (EPTCS) for accepting to publish these proceedings in their series.

Ion Petre

Turku, Finland, May 2013

Workshop organizer and PC chair

Computational Methods in Systems Biology: Case Studies and Biological Insights

Daniela Besozzi (Università degli Studi di Milano, Dipartimento di Informatica, Milano, Italy)

Does Cell Biology Need Modeling?

In the last decades, mathematical modeling and computational methods have proven to be valuable approaches to achieve a better comprehension of the functioning of biological systems, which are complex (non linear) dynamical systems evolving in time and in space [6, 15, 27]. Computational investigations can aid in bridging the gap between what is already known of a biological system of interest, and what is hardly comprehensible through the interpretation of laboratory work outcomes (that sometimes can be inconsistent with previous results) or by merely grounding on data-driven hypotheses (that often require the further design of complicated and onerous verification experiments). In some cases, mathematical models can simply represent a thought-provoking standpoint for further extended analysis [18], but most of the times they allow to derive useful predictions on the system functioning.

The choice of the proper modeling and computational approach is constrained by the nature of the biological system under examination and by the experimental data in hand. Before the mathematical model of a biological system can be defined, one needs to solve some preliminary issues.

A thorough description of the above questions, and of the related applicable formal and computational methods, is clearly outside the scope of this extended abstract; general discussions and practical applications can be found in [1, 6, 9, 12, 15, 25, 26, 27, 29, 30].

In what follows, I briefly discuss the application of the mechanism-based modeling approach and the related computational methods, chosen according to the biological questions established for the study of two signal transduction pathways. In particular, I consider here state-dependent reaction-based models, where reactions describe molecular interactions based on the mass-action law; these models are fully parameterized from both the kinetic and the stoichiometric points of views, that is, (i) a kinetic constant is associated to each reaction, (ii) precise values of reactant and product molecules are specified for every species occurring in each reaction, and (iii) the initial state of the system is defined in terms of molecular amounts of all species.

This type of modeling approach presents some remarkable advantages with respect to other classical formalisms. First of all, reaction-based models are immediately understandable to experimental biologists, who are usually not familiar with mathematical formalisms, and therefore they can easily intervene in the discussion and in the development of the model itself. Secondly, they can be easily modified during further model refinements (e.g., after the model has been validated with laboratory experiments), in order to introduce new species or new reactions that were not included in the first formalization. Last but not least, the same model can be exploited to carry out both stochastic and deterministic simulations of the system dynamics, by simply transforming the reaction-based into a "generalized mass-action based" model (formalized by means of ordinary differential equations), and by converting stochastic constants and molecule copy numbers into kinetic rates and concentrations, respectively [13, 31].

Dynamics simulations then allow to globally explore the behavior of the system, in both quantitative and qualitative ways, so that by modulating the model components (species, reactions, and their related parameters) it is possible to better understand complex biological phenomena whenever simple intuition might fail to solve the established biological questions, or where contradictory experimental data prevent to find lifelike and unambiguous answers.

A Bird's Eye View on Two Case Studies

I discuss the investigation of two signal transduction pathways in yeast that have been a topic of my research in the last years, considered here as case studies to highlight the relevance of the crosstalk between computational scientists and biologists, and to show that in silico computational analysis can be reasonably considered as a complementary and effective integration tool to traditional wet approaches in Cell Biology.

The Ras/cAMP/PKA pathway

In S. cerevisiae, the Ras/cAMP/PKA signalling pathway regulates cell metabolism and cell cycle progression in response to nutritional sensing and stress conditions [28]. These processes are controlled through multiple feedback loops exerted by the protein kinase A (PKA) on a few pivotal components of the pathway, i.e., proteins Cdc25 and Ira (which regulate the GTPase Ras protein in a positive and negative manner, respectively) and two phosphodiesterases (which control the signal transduction termination through the degradation of cAMP). Both experimental and computational studies of this pathway are complicated by the complex interplay between these regulation mechanisms, making it hard to predict the behavior of the pathway in different growth conditions or in response to stress signals.

Previous experimental studies reported the observation of oscillations related to the Ras/cAMP/PKA pathway, though this was stated only in indirect ways, that is, by analyzing downstream processes as the nucleocytoplasmic shuttling of Msn2 [11, 17], a transcription factor that is under the control of PKA; other analysis then showed that the oscillations frequency of Msn2 between the nucleus and the cytoplasm can be influenced by PKA as well as by the two phosphodiesterases [5]. This periodicity can be ascribed to an oscillatory behavior of cAMP concentration and of PKA activity, though no direct measurements of the dynamics of these components were executed in vivo to verify their effective role.

In order to understand the role of these feedback controls, and to make predictions on the cellular conditions that are able to regulate the transition between stationary and oscillatory regimes, a computational analysis of the Ras/cAMP/PKA pathway was carried out, primarily focusing on the mechanisms that allow the emergence of oscillatory regimes in cAMP dynamics [4, 23]. The model of the Ras/cAMP/PKA pathway was initially developed according to the stochastic formulation of chemical kinetics [13], and then translated into a generalized mass-action based model [31], in order to compare the outcomes of stochastic and deterministic simulations. To this aim, specific computational tools had to be developed, in order to quantify the amplitude and frequency of cAMP oscillations in stochastic simulations. The choice of the stochastic approach was motivated by the low molecular amounts of some pivotal components of the pathway (i.e., Cdc25 and Ira proteins). The comparison between stochastic and deterministic analysis evidenced different quantitative and qualitative dynamics of cAMP under various conditions (see, e.g, Figure 6 and 13 in [4]), suggesting a functional role of the intrinsic noise occurring in this pathway.

In particular, the computational analysis indicated that stable oscillatory regimes of cAMP can be established when the negative feedback on Ira proteins is activated, and that this dynamics is regulated by the balance between the activities of Cdc25 and Ira proteins. In addition, the analysis highlighted that GTP and GDP pools, which concurrently regulate Ras proteins, are able to control the transition between steady states and oscillations of cAMP, and they also suggested a regulative role for the phosphodiesterases [4]. Overall, the model allowed to find out, in a detailed quantitative way, that the coupling between the feedback mechanisms and the molecular level of Ras modulators (Cdc25/Ira, GTP/GDP) is able to influence the oscillatory regimes of cAMP and PKA.

One of the biological insights achieved with the modeling of the Ras/cAMP/PKA pathway concerns the multi-level regulation carried out through different feedback mechanisms, which might represent a way to extend the regulatory span of the system in yeast cells, and might therefore act as a tuning mechanism for the control of the numerous downstream targets of PKA. For instance, the modeling suggested that a frequency modulation can be achieved through the perturbation of the ratio between the cellular amounts of Cdc25 and Ira proteins.

Despite the Ras/cAMP/PKA pathway has been extensively investigated in S. cerevisiae, time-course data on cAMP variations in yeast cells are still lacking, due to the difficulty of developing an accurate experimental protocol to carry out these measurements. Laboratory work is currently in progress to set up a FRET-sensor able to respond to cAMP levels in S. cerevisiae, which should allow to measure in vivo the changes in the level of cAMP in single cells [8]. This very promising protocol still needs to be optimized, in order to increase its stability and to reduce the noise intrinsic in the measurements, so that long term oscillations of cAMP in single yeast cells will finally become measurable. As a consequence, the novel experimental setup will allow to carry out a full validation of the model, and to conduct an in depth analysis of the response of the pathway to different nutritional and stress conditions.

The Post-Replication Repair pathway

Post-Replication Repair (PRR) is a DNA-damage tolerance pathway involved in the recognition and bypass of the genome lesions induced by sunlight exposure and UV radiation [19, 24]. PRR acts through two different sub-pathways, that are activated by a post-translational modification of PCNA, a protein acting as a scaffold for the binding of several proteins involved in DNA replication, repair and cell cycle regulation. In S. cerevisiae, the experimental investigation of this complicated and not well characterized pathway determined that PCNA mono-ubiquitylation drives the activation of an error-prone DNA lesion-bypass mechanism (Translesion DNA Synthesis, TLS), while PCNA poly-ubiquitylation activates an error-free mechanism (Template Switching, TS) [32]. A major issue in the study of PRR consists in understanding how the dynamics of PCNA ubiquitylation might influence the choice between TLS and TS, or whether there exists a damage-related threshold able to regulate the crosstalk between these sub- pathways.

In order to dissect these aspects, we defined a reaction-based stochastic model of PRR [2]. The choice of the stochastic framework was motivated by the low molecular amounts of most species involved in PRR, and by the quite large noise that we evidenced in the experimental data when measuring the ratio between mono- and poly-ubiquitylated PCNA. The definition of the mathematical model of PRR benefited from a preliminary bioinformatic analysis based on 3D protein structure modeling, which served to confirm the actual spatio-temporal cascade of interactions between the proteins involved in PRR, which were not completely known; in addition, a successive computational phase based on parameter sweep analysis and sensitivity analysis was used to test the reliability of the model parameterization.

The laboratory measurements of mono- vs. poly-ubiquitylated PCNA ratio were used as the wet read- out of the cellular response to acute UV irradiation, carried out at low and high doses in both wild-type and mutant yeast cells. To this aim, a specific experimental protocol had to be developed so that, differently from other previously devised methods, both mono- and poly-ubiquitylated PCNA isoforms could be detected on a single western blot; these measurements were then processed in order to compare the different kinds of experimental and computational results (that is, the ratios of modified PCNA derived in laboratory, and the molecular amounts of modified PCNA obtained from stochastic simulations).

The results of the continuous crosstalk between experimental and computational analysis suggested the existence of a UV dose threshold for the proper functioning of the PRR model, while at higher UV doses we came across an unexpected discrepancy between wet data and simulation outcomes. This inconsistency first led to several steps of model verification, and then to the formulation of novel biological hypotheses on the functioning of PRR. As a result, further laboratory experiments especially designed to validate the model predictions evidenced an overlapping effect of Nucleotide Excision Repair (the cellular pathway effectively responsible to clean the genome from UV lesions) on the dynamics of PCNA ubiquitylation in different phases of the cell cycle [2]. Altogether, these findings highlighted an intricate functional crosstalk between PRR and other processes controlling genome stability, establishing that PRR is less far characterized than previously thought, and opening new research perspectives on the study of DNA repair and tolerance systems.

Concluding Remarks

In the investigation of biological systems, a negative result - either obtained from unmatching simulation outcomes, or due to unexpected experimental data - can be as important as a positive result, since it highlights a deficiency in the comprehension of the underlying molecular mechanisms. As a matter of fact, sometimes the most interesting part of the modeling game is not what the model allows to understand, but exactly what it cannot explain. Obviously, the model's incapability necessarily requires a careful check to determine the source of the weaknesses, which can be related to some defects or incompleteness of the model, as well as to unforeseen bugs in the experiment setup or in the interpretation of the corresponding measurements.

An important problem, related to these aspects, is that modelers typically have insufficient knowledge of laboratory methodologies, and are usually not able to interpret the results of an experiment, to determine if literature data are fully reliable, or to settle the right questions that can drive the development of a good model. To overcome such limitations, only a continuous side-by-side work between modelers and biologists, as well as the iterative crosstalk between the computational and experimental outcomes they generate, can lead to a proper Systems Biology interdisciplinary workflow consisting in data-driven modeling and model-driven experiments, that can eventually turn into a better understanding of the functioning of the system under investigation.

Indeed, in order for a model to be useful to experimental biologists, it must be strictly linked to biological data and, possibly, rely to dedicated ad hoc measurements. To this aim, experimental biologists should be open to design new laboratory experiments and specific measurement methodologies that allow to identify the qualitative and, most important, the quantitative features of the system that are still lacking, while modelers should be prepared to develop novel mathematical and computational tools, that are able to deal with the established biological questions and the actual data in hand, instead of "forcing" the system to be described and analyzed with some favorite and handily applicable framework. The awareness and fulfillment of these requisites is surely likely to increase our ability in comprehending the complexity of biological systems.

Concerning the computational analysis of biological systems, an important aspect related to the computational costs is worth to be mentioned here. In order to gain novel insights into the functioning of a biological system, under physiological or perturbed conditions, the application of many computational methods requires the execution of a large number of simulations, especially when one wants to derive statistically meaningful properties or to exploit tools based on parameter sweep analysis, sensitivity analysis, parameter estimation and reverse engineering of model topologies (see [1, 7] and references therein). These analyses usually result into very high computational costs when the simulations are executed on standard central processing units (CPUs), therefore demanding a shift to the use of parallel architectures (e.g., computer clusters) or high-performance processors. To this aim, the higher memory bandwidth and computational capability of graphics processing units (GPUs) have been increasingly exploited in the last years in the field of Bioinformatics and Computational Biology [10, 14, 22]. Besides the relevant reduction of computational costs and the possibility to run multiple analysis in parallel, general-purpose GPU computing benefits from cheap, diffused and highly efficient multi-core devices, and it provides an effective mean to sustainable computing since it drastically reduces the power consumption. Despite these advantages, scientific applications of GPUs risk to remain a niche for few specialists, since GPU-based programming substantially differs from CPU-based computing and therefore requires specific programming skill. In order to provide user-friendly computational methods to analyze the dynamics of biological systems, we started to develop GPU-powered tools to carry out stochastic [21], deterministic [20] and hybrid simulations [3] for mass-action based models. These methods have been successfully tested to execute sensitivity analysis, parameter sweep analysis and parameter estimation of the Ras/cAMP/PKA and PRR models, showing that the GPU implementation allows to gain a speedup ranging from 25x to 50x with respect to the corresponding CPU implementation of the same simulation algorithms.

Finally, a sentence that perfectly resumes the concepts discussed above and epitomizes the hurdles and the beauty inherent to the computational analysis of biological systems, can serve here as my last remark: "Building a good model depends too much on intuition, on the rare abilities to ask the right question and to sense mathematical order behind messy facts, on tricky timing (not too early, when absence of data leaves a model unsubstantiated, not too late, when everything is clear without a model), and on hard, long thinking that makes modeling so painful, but also so much fun" [18].


[1] B.B. Aldridge, J.M. Burke, D.A. Lauffenburger & P.K. Sorger (2006): Physicochemical modelling of cell signalling pathways. Nature Cell Biology 8(11), pp. 1195-203, doi:10.1038/ncb1497.

[2] F. Amara, R. Colombo, P. Cazzaniga, D. Pescini, A. Csikasz-Nagy, M. Muzi Falconi, D. Besozzi & P. Plevani (2013): In vivo and in silico analysis of PCNA ubiquitylation in the activation of the Post Replication Repair pathway in S. cerevisiae. BMC Systems Biology 7(24), doi:10.1186/1752-0509-7-24.

[3] D. Besozzi, G. Caravagna, P. Cazzaniga, M.S. Nobile, D. Pescini & A. Re (2013): GPU-powered simulation methodologies for biological systems. Submitted to Wivace2013.

[4] D. Besozzi, P. Cazzaniga, D. Pescini, G. Mauri, S. Colombo & E. Martegani (2012): The role of feedback control mechanisms on the establishment of oscillatory regimes in the Ras/cAMP/PKA pathway in S. cerevisiae. EURASIP Journal on Bioinformatics and Systems Biology 10, doi:10.1186/1687-4153-2012-10.

[5] K. Bodvard, D. Wrangborg, S. Tapani, K. Logg, P. Sliwa, A. Blomberg, M. Kvarnstrom & M. Kall (2011): Continuous light exposure causes cumulative stress that affects the localization oscillation dynamics of the transcription factor Msn2p. Biochimica et Biophysica Acta (BBA) - Molecular Cell Research 1813(2), pp. 358-366, doi:10.1016/j.bbamcr.2010.12.004.

[6] J.M. Bower & H. Bolouri, editors (2001): Computational Modeling of Genetic and Biochemical Networks. The MIT Press.

[7] I.C. Chou & E.O. Voit (2009): Recent developments in parameter estimation and structure identification of biochemical and genomic systems. Mathematical Biosciences 219(2), pp. 57-83, doi:10.1016/j.mbs.2009.03.002.

[8] S. Colombo, S. Broggi, L. D'Alfonso, M. Collini, G. Chirico & E. Martegani (2011): Monitoring cyclic-AMP changes in a single yeast cell. FEBS Journal 278 (Suppl. 1), doi:10.1111/j.1742-4658.2011.08137.x. P21.37.

[9] J. Decraene & T. Hinze (2010): A multidisciplinary survey of computational techniques for the modelling, simulation and analysis of biochemical networks. Journal of Universal Computer Science 16(9), pp. 1152-1175, doi:10.3217/jucs-016-09-1152.

[10] L. Dematte` & D. Prandi (2010): GPU computing for systems biology. Briefings in Bioinformatics 11(3), pp. 323-33, doi:10.1093/bib/bbq006.

[11] C. Garmendia-Torres, A. Goldbeter & M. Jacquet (2007): Nucleocytoplasmic oscillations of the yeast transcription factor Msn2: Evidence for periodic PKA activation. Current Biology 17(12), pp. 1044-9, doi:10.1016/j.cub.2007.05.032.

[12] D. Gilbert, H. Fuss, X. Gu, R. Orton, S. Robinson, V. Vyshemirsky, M.J. Kurth, C.S. Downes & W. Dubitzky (2006): Computational methodologies for modelling, analysis and simulation of signalling networks. Briefings in Bioinformatics 7(4), pp. 339-353, doi:10.1093/bib/bbl043.

[13] D.T. Gillespie (1977): Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry 81(25), pp. 2340-2361, doi:10.1021/j100540a008.

[14] M.J. Harvey & G. De Fabritiis (2012): A survey of computational molecular science using graphics processing units. Wiley Interdisciplinary Reviews: Computational Molecular Science 2(5), pp. 734-742, doi:10.1002/wcms.1101.

[15] E. Klipp, W. Liebermeister, C. Wierling, A. Kowald, H. Lehrach & R. Herwig (2009): Systems Biology: A Textbook. Wiley.

[16] A. Kolodkin, E. Simeonidis & H.V. Westerhoff (2013): Computing life: Add logos to biol ogy and bios to physics. Progress in Biophysics and Molecular Biology 111(2-3), pp. 69-74, doi:10.1016/j.pbiomolbio.2012.10.003.

[17] O. Medvedik, D.W. Lamming, K.D. Kim & D.A. Sinclair (2007): MSN2 and MSN4 link calorie restriction and TOR to Sirtuin-mediated lifespan extension in Saccharomyces cerevisiae. PLOS Biology 5(10), pp. 2330-2341, doi:10.1371/journal.pbio.0050261.

[18] A. Mogilner, R. Wollman & W.F. Marshall (2006): Quantitative modeling in Cell Biology: What is it good for? Developmental Cell 11(3), doi:10.1016/j.devcel.2006.08.004.

[19] G.L. Moldovan, B. Pfander & S. Jentsch (2007): PCNA, the maestro of the replication fork. Cell 129(4), pp. 665-679, doi:10.1016/j.cell.2007.05.003.

[20] M.S. Nobile, D. Besozzi, P. Cazzaniga, G. Mauri & D. Pescini (2013): cupSODA: a CUDA-powered simulator of mass-action kinetics. In: Proceedings of 12th International Conference on Parallel Computing Technologies, LNCS 7979. In press.

[21] M.S. Nobile, P. Cazzaniga, D. Besozzi, D. Pescini & G. Mauri (2013): Massive parallel tau-leaping stochastic simulations on GPUs for the analysis of biological systems. Manuscript in preparation.

[22] J.L. Payne, N.A. Sinnott-Armstrong & J.H. Moore (2010): Exploiting graphics processing units for computational biology and bioinformatics. Interdisciplinary Sciences, Computational Life Sciences 2(3), pp. 213-20, doi:10.1007/s12539-010-0002-4.

[23] D. Pescini, P. Cazzaniga, D. Besozzi, G. Mauri, L. Amigoni, S. Colombo & E. Martegani (2012): Simulation of the Ras/cAMP/PKA pathway in budding yeast highlights the establishment of stable oscillatory states. Biotechnology Advances 30(1), pp. 99-107, doi:10.1016/j.biotechadv.2011.06.014.

[24] R.P. Sinha & D.-P. Hader (2002): UV-induced DNA damage and repair: a review. Photochemical & Photobiological Sciences 1, pp. 225-236, doi:10.1039/B201230H.

[25] J. Stelling (2004): Mathematical models in microbial systems biology. Current Opinion in Microbiology 7(5), pp. 513-518, doi:10.1016/j.mib.2004.08.004.

[26] M.P.H. Stumpf, D.J. Balding & M. Girolami, editors (2011): Handbook of Statistical Systems Biology. John Wiley & Sons, Ltd.

[27] Z. Szallasi, J. Stelling & V. Periwal (2006): Systems Modeling in Cellular Biology. The MIT Press.

[28] J.M. Thevelein (1994): Signal transduction in yeast. Yeast 10(13), pp. 1753-1790, doi:10.1002/yea.320101308.

[29] D. Wilkinson (2009): Stochastic modelling for quantitative description of heterogeneous biological systems. Nature Reviews Genetics 10, pp. 122-133, doi:10.1038/nrg2509.

[30] D.J. Wilkinson (2006): Stochastic Modelling for Systems Biology. Chapman & Hall.

[31] O. Wolkenhauer, M. Ullah, W. Kolch & C. Kwang-Hyun (2004): Modeling and simulation of intracellular dynamics: choosing an appropriate framework. IEEE Transactions on NanoBioscience 3(3), pp. 200-7, doi:10.1109/TNB.2004.833694.

[32] W. Zhang, Z. Qin, X. Zhang & W. Xiao (2011): Roles of sequential ubiquitination of PCNA in DNA-damage tolerance. FEBS Letters 585(18), pp. 2786-94, doi:10.1016/j.febslet.2011.04.044.

Computational Methods for Metabolic Networks

Juho Rousu (Department of Information and Computer Science, Aalto University, Finland)

Metabolic networks provide a rich source of computational problems and applications including metabolic reconstruction, pathway search and flux analysis. The main thrust in the research field has so far been on single species models with metabolites appearing as atomic entities. In this talk I will discuss two less explored viewpoints on metabolic network models, namely atom-level representations, and simultaneous reconstruction of networks for multiple species. I will look at the computational challenges, algorithms, and modelling benefits of these viewpoints and present a case study using our new CoReCo framework that implements these viewpoints.

Rule-Based Modeling of Polymerization Reactions

Thilo Krüger (Department of Computer Science, Saarland University, Saarbrücken, Germany)
Verena Wolf (Department of Computer Science, Saarland University, Saarbrücken, Germany)

The modeling of polymerization reactions is a field of great interest. Since polymer chemistry deals with synthetic macromolecules, which are the raw materials of plastics, polymerizations are important for the area of materials science. By using models of polymerizations it is possible to optimize industrial processes in two ways, optimization of the process itself (utilization/yield) and therefore the reduction of production cost and optimization of the properties of the technical products, such as the stiffness of plastics, the porosity of foams or the adhesive power of adhesives. The most common approach to model polymerization reactions is based on ordinary differential equations (see [6] and the references therein).

Certain aspects of polymerization reactions, however remain hidden in ODE models. Most of these aspects concern the microstructure of the products of the reactions. For example, it is difficult to model the exact monomer sequence in copolymer products. Another example for such a microstructural issue is the distribution of branching points and other network parameters in polymerizations with branching reactions. Polymerizations can be branching, if some monomers have more than two possible reaction sites. In the extreme case, the complete product of such a reaction can be a single molecule, representing a complex molecular network. Recent work simulates branching reactions using the Gillespie method [5]. The major problem these approaches are dealing with is the large number of possible successor states for most of the states. The presented solutions for this problem are based on grouping of the reactions [2, 8]. These solutions are specialized for the presented models and constitute no general approach to deal with the problem of large numbers of transitions. Nevertheless, more detailed network characteristics can be obtained in some of the stochastic models. A concrete example for data which could be obtained by such models but which were difficult to get via ODE models is the bivariate MW-LCB distribution [8].

In general there are many technical products, which contain branched polymers. A well-known example are epoxy resins [9], which we use here to illustrate the principle mechanisms of a possible branching polymer reaction. The first reaction is the formation of linear prepolymers (cf. Figure 1 in [9] for the structure of the prepolymer DGEBA). The reactive groups (sites) of the products of this reaction are epoxy-groups, which are highly reactive cyclic ethers with three ring atoms. The prepolymers undergo a curing process afterwards (cf. Figure 1 in [9] for examples of hardeners). Therefore, they might for example react with a diamine. Since each amine group can react twice, a diamine reacts like a molecule with four sites. This reaction leads to a polymer network, therefore obtaining the detailed network parameters by modeling via ODEs is difficult as mentioned above.

In order to combat the problems related to the huge number of possible transitions of a state, we consider stochastic simulation methods for complex biochemical reaction networks. Rule-based modeling approaches solve the complexity problem. A detailed analysis of this approach can be found in [1, 3, 10]. In the case of the example above the epoxy groups are reacting with the hydrogens of the amine groups, not depending on the chemical environment of the epoxy group. Therefore applying rule-based modeling will gain a huge reduction of the complexity.

In this work, our goal is to apply rule-based modeling to polymerization reactions. Since the existing tools for running simulations on rule-based models are specialized for applications from systems biology, several adaptions are necessary to use them for the simulation of polymerization reactions. The largest difference is that the modeled processes are macroscopic. Polymerizations happen in large reactors instead of small cells, where processes in systems biology occur. Therefore the number of participating molecules is much higher than in models of microscopic processes. To get representative results for microstructural parameters of the products, the simulations should run with large numbers of monomers and therefore the product should be represented in a compressed way. One possibility for this is to represent longer chains of monomers without active reaction sites via a single symbol Mx, with x as the number of the monomers M in the chain. Other differences are due to model details of polymerization reactions. Some rate constants for example were found to be dependent on the chain length of the reacting polymers [7]. In certain polymerization reactions the formation of molecular cycles is not possible [4]. Forbidding cycles and combining chains of the same monomer type can be modeled with existing tools, while chain-dependent transition rates cannot. One important goal of modeling polymerization reactions is the computation of distributions of quantities like numbers of molecules with a certain chain length or the number of branches dependent on the weight like in [8]. The existing tools do not directly offer such a functionality. These examples show that specialized software tools for the rule-based simulation of polymerization reactions are necessary.


[1] V. Danos, J. Feret, W. Fontana, R. Harmer & J. Krivine (2007): Rule-Based Modelling of Cellular Signalling. Lect Note Comput Sci 4703, pp. 17-41, doi:10.1007/978-3-540-74407-8 3.

[2] M. Drache, G. Schmidt-Naake, M. Buback & P. Vana (2005): Modeling RAFT polymerization kinetics via Monte Carlo methods: cumyl dithiobenzoate mediated methyl acrylate polymerization. Polymer 46, pp. 8483-8493, doi:10.1016/j.polymer.2004.11.117.

[3] J. R. Faeder, M. L. Blinov, B. Goldstein & W. S. Hlavacek (2005): Rule-Based Modeling of Biochemical Networks. Complexity 10, pp. 22-41, doi:10.1002/cplx.20074.

[4] P. J. Flory (1941): Molecular Size Distribution in Three Dimensional Polymers. I. Gelation. Journal of the American Chemical Society 63, pp. 3083-3090, doi:10.1021/ja01856a061.

[5] D. T. Gillespie (1976): A General Method for Numerically Simulating the Stochastic Time Evolution of Coupled Chemical Reactions. Journal of Computational Physics 22, pp. 403-434, doi:10.1016/0021-9991(76)90041-3.

[6] C. Kiparissides (1996): Polymerization Reactor Modeling: A Review of Recent Developments and Future Directions. Chemical Engineering Science 51, pp. 1637-1659, doi:10.1016/0009-2509(96)00024-3.

[7] H. K. Mahabadi (1985): Effects of Chain Length Dependence of Termination Rate Constant on the Kinetics of Free-Radical Polymerization. 1. Evaluation of an Analytical Expression Relating the Apparent Rate Constant of Termination to the Number-Average Degree of Polymerization. Macromolecules 18, pp. 1319-1324, doi:10.1021/ma00148a048.

[8] D. Meimaroglou, A. Krallis, V. Saliakas & C. Kiparissides (2007): Prediction of the Bivariate Molecular Weight - Long Chain Branching Distribution in Highly Branched Polymerization Systems Using Monte Carlo and Sectional Grid Methods. Macromolecules 40, pp. 2224-2234, doi:10.1021/ma0623439.

[9] S. G. Prolongo, G. del Rosario & A. Urena (2006): Comparative study on the adhesive properties of different epoxy resins. International Journal of Adhesion & Adhesives 26, pp. 125-132, doi:10.1016/j.ijadhadh.2005.02.004.

[10] M. W. Sneddon, J. R. Faeder & T. Emonet (2011): Efficient modeling, simulation and coarse-graining of biological complexity with NFsim. Nature Methods 8, pp. 177-185, doi:10.1038/nmeth.1546.

Robustness Analysis of Stochastic Systems

Luboš Brim (Systems biology laboratory at Faculty of Informatics, Masaryk University, Brno, Czech Republic)
Milan Češka (Systems biology laboratory at Faculty of Informatics, Masaryk University, Brno, Czech Republic)
Sven Dražan (Systems biology laboratory at Faculty of Informatics, Masaryk University, Brno, Czech Republic)
David Šafránek (Systems biology laboratory at Faculty of Informatics, Masaryk University, Brno, Czech Republic)

Models of biological phenomena occurring in low populations (molecules or cells) may be more precisely modeled stochastically rather than by ordinary differential equations (ODEs), this enables to capture effects of random noise or fluctuations leading to multiple stable states. Even under conditions of stochastic fluctuations and changing parameters some properties of models may be robustly satisfied, quantification and precise computation of such robustness is the main goal of our approach.

Properties of models can be expressed in temporal logics such as Continuous Stochastic Logic (CSL) [1, 2] and automatically verified using statistical model checking with extensive use of simulations or numerical methods based on uniformization and transient probability analysis [5]. Even though the former is sometimes considered more computationally feasible since the later makes use of large state space structures, uniformization can outperform statistical model-checking in cases where phenomena with very low probabilities are involved [6] or results are required with high precision.

Robustness, as generally defined by Kitano [4], is the ability of a system S to maintain its functionality a over a set of perturbations P of its internal or external parameters. RSa,P = ∫Pψ(p) DSa(p)dp. Formally it is the integral of the evaluation function DSa over all p ∈ P weighted by the probability of occurrence of each such perturbation ψ(p). DSa(p) can be computed as the quantitative validity of the temporal property a in p by above mentioned formal methods.

In the case of a stochastic system we can define functionality by means of a CSL formulae and understand perturbations as changing values of its numerical constants such as initial population numbers, kinetic rates or stoichiometric coefficients leading to different system dynamics. These numerical constants may not be known and are only guessed within several orders of magnitude, may be uncertain due to limited precision of measurements or do not have a single value because of inherent heterogeneity.

While for deterministic ODE models robustness computation can be carried out using trajectory simulation and property monitoring in different perturbation points, a stochastic model has instead of a single trajectory a time evolving state probability distribution which substantially increases computational complexity.

To be able to compute the robustness of a stochastic system S with respect to a CSL property a and a perturbation space P we must be able to estimate the quantitative validity/probability of a over P or subspaces of P. For discrete parameters this is straight forward although possibly demanding, however for continuous parameters such as kinetic rates only approximative sampling techniques were feasible up to now.

By using a modified version of uniformization [3] we have proposed a method to compute the upper and lower bounds of property validity over continuous parameter spaces. The main idea is to compute for each model state its probability as an upper and lower bound instead of a single number. This is possible by computing the minimum resp. maximum stochastic rate of each reaction with parameters being perturbed over the examined value interval. In case of more complex nested CSL properties minimum and maximum sets of states satisfying nested formulae have to be computed. Depending on the form of reaction kinetics min/max rates can be more or less precisely computed (mass action kinetics precisely, sigmoid hill kinetics approximately), together with local greedy computation the resulting under/over approximation is more or less exact and refinement may be needed to achieve requested precision.

This modified min/max uniformization together with a refinement procedure on the parameter space leads to an over/under approximation of the robustness with arbitrary precision accompanied by correspondingly increased computation complexity. For even small models, computation with a sensible precision can be very demanding in the order of hours, fortunately the algorithm can be straight-forwardly distributed.

A prototype implementation based upon the tool PRISM is used for dynamics analysis of the simple one-dimensional Schloegl's model [7] with pure mass action kinetics (Figure 1, 1000 states). A more elaborate four-dimensional model of a virtual signal transduction mechanism in cells is analyzed to show how two different variants of the model cope with increasing levels of intrinsic noise induced by gene transcription (116 000 states).

Figure 1: Schloegl's model: state probabilities show bistability; size of each stable mode is analyzed with respect to the perturbed kinetic parameter k1; computation with precision 1% (left) and 0.1% (right); upper bound on probability of each state in red, lower in green.


[1] Adnan Aziz, Kumud Sanwal, Vigyan Singhal & Robert Brayton (1996): Verifying Continuous Time Markov Chains. In Rajeev Alur & Thomas A. Henzinger, editors: CAV 2009, LNCS 1102, Springer, pp. 269-276, doi:10.1007/3-540-61474-5_75.

[2] C. Baier, B. Haverkort, H. Hermanns & J.P. Katoen (2003): Model-Checking Algorithms for Continuous-Time Markov Chains. IEEE Trans. on Soft. Eng. 29(6), pp. 524-541, doi:10.1109/TSE.2003.1205180.

[3] Luboš Brim, Milan Češka, Sven Dražan & David Šafránek (2013): Exploring Parameter Space of Stochastic Biochemical Systems using Quantitative Model Checking. In: to appear in CAV 2013.

[4] Hiroaki Kitano (2007): Towards a theory of biological robustness. Molecular Systems Biology 3(137), doi:10.1038/msb4100179.

[5] Marta Kwiatkowska, Gethin Norman & David Parker (2007): Stochastic Model Checking. In: Formal Methods for Performance Evaluation, LNCS 4486, Springer, pp. 220-270, doi:10.1007/978-3-540-72522-0_6.

[6] Brian Munsky & Mustafa Khammash (2006): The finite state projection algorithm for the solution of the chemical master equation. The Journal of Chemical Physics 124(4), p. 044104, doi:10.1063/1.2145882.

[7] F. Schlögl (1972): Chemical Reaction Models for Non-Equilibrium Phase Transitions. Zeitschrift fur Physik 253, pp. 147-161, doi:10.1007/BF01379769.

Mathematical modelling of the platelet-derived growth factor signalling pathway

Andrzej Mizera (University of Luxembourg, Luxembourg)
Jun Pang (University of Luxembourg, Luxembourg)
Thomas Sauter (University of Luxembourg, Luxembourg)
Panuwat Trairatphisan (University of Luxembourg, Luxembourg)

Constructing a mathematical/computational model for a biological system consists of two main steps: first, of specifying the model structure and, second, of determining the numerical values for the parameters of the model. Usually, the structure of the model is represented in the form of a biochemical reaction network and the parameters are the reaction rate constants. The values of the reaction rates can either be measured directly in experiments or determined by fitting the model to experimental data by performing parameter estimation. Since the former approach is often impeded by technical limitations or high costs of experimental practice, parameter fitting is a fundamental problem in systems biology research.

In this study we consider issues related to parameter estimation for a model of the platelet-derived growth factor (PDGF) signalling pathway with use of steady-state data and so called knock-down mutant models, i.e., variants of the original model obtained by suppressing one or more interactions in the model. Since the knock-down mutants of the real biological system can be obtained and investigated in experimental practice as well as the physical/chemical properties are usually common for all variants, the measurements of the mutants can enrich the set of data available for parameter estimation. We consider parameter estimation both in the deterministic (system of ordinary differential equations) and stochastic (continuous-time Markov chain) modelling frameworks. We discuss certain difficulties related to parameter estimation we encountered while modelling the PDGF signalling pathway and present our solutions.

Rule-based modelling as a platform for the analysis of synthetic self-assembled nano-systems

Abdulmelik Mohammed (Aalto University, Espoo, Finland)
Eugen Czeizler (Aalto University, Espoo, Finland)

Recent advances in DNA-based nano-technology have opened the way towards the systematic engineering of inexpensive, nucleic-acid based nano-scale devices for a multitude of purposes[4]. The field itself evolved dramatically in the past 10-15 years from a state where careful manual design and intimate knowledge of DNAs atomic structure were needed for the design of simple structures [6], to the current algorithmic approaches employing the use of universally engineered elementary building blocks that are further functionalized and driven to self-assemble into the desired shapes, see [2, 8]. The Tile Assembly Model (TAM), which is the theoretical design platform for the above modular assembly scheme, employs the use of so-called DNA-tiles, which can be seen as unit square blocks with active glues on their four (North, East, South and West) edges. These active glues, implemented using single-stranded DNA sticky-ends, are driving the self-assembly process and determine the controlled aggregation of the tiles into the desired structures.

Since its introduction [5], the TAM formalism has been used successfully both in designing complex assembly nano-structures and in providing predictions regarding the possible experimental outcomes of certain designs. To this extent, the Tile Assembly Model has been expanded to a kinetic counterpart, kTAM [9]. The kTAM incorporates two types of reactions: association of tiles to an assembly (forward reaction) and dissociation (reverse reaction), see e.g. Figure 1 a) in the PDF verison of this article. In a forward reaction, any tile can attach to the assembly at any position, even if only a weak bond is formed; the rate of this reaction, rf, is proportional to the concentration of free tiles in the solution. In the second reaction, any tile can detach from the assembly with rate rr,b, b∈{0...5}, which depends exponentially on the total strength of the bonds between the tile and the assembly. Thus, tiles which are connected to the assembly by fewer or weaker bonds are more prone to dissociation than those which are strongly connected. Previous com- putational modeling of kTAM has been performed almost exclusively using a special tailored version of the Gillespie's stochastic simulation algorithm. The assembly starts at t = 0 from a seed structure. Then, in discrete time steps, tiles are added or detached from the assembly, according to the corresponding association and dissociation rates.

In our work, we propose a rule-based modelling approach for predicting the time evolution of kTAM systems and their variants. Rule-based modelling is a discrete modeling technique [1] in which molecules are represented as agents with a finite number of free sites. The sites, which may have several internal states, allow for agent-agent biding, thus generating molecular complexes. Rules are defined based on local patterns rather than by the full specification of the reactants, and thus provide a compact representation on how agents interact. In this way, rather than handling explicitly a large number of model variables, we often only have a small number of local interaction rules. This makes the rule-based paradigm well suited in taming the problem of the combinatorial explosion of the state space, as is the case of modelling self-assembly or polymerization systems, see e.g., [7].

In the modelling of DNA tile systems, the primary agent, Tile, has a number of sites, including Nedge, Eedge, Sedge, and Wedge, each with internal states corresponding to possible glues of these tiles. Then, both tile-association and tile-dissociation reactions can be implemented by appropriate local rules. For example, the addition of a tile to a free North site, as depicted in Figure 1 b) in the PDF verison of this article, can be implemented by the local rule,

Tile(Nedge,in~1) + Tile(Sedge, in~0) → Tile(Nedge!1,in~1).Tile(Sedge!1,in~1), kon

where in is a specific site which is in state 1 if the tile is inside an assembly and 0 otherwise. The decoding of the above rule is as follows: a Tile with an unbounded Nedge site placed inside the assembly (i.e., in~1) interacts with a free Tile (i.e., in~0) with an unbounded Sedge site, and the two become bonded on the sites Nedge and Sedge; the reaction has a kinetic rate constant kon. Tile dissociation reactions, which are dependent on the total bond strength of the tiles, can be implemented in a similar manner.

There are a number of advantages in using a rule-based modelling approach for kTAM. Since this is a coarse-grain modelling framework, it allows the examination of a very diverse family of observables. Thus, the system can be analyzed extensively and both final and intermediate states can be inquired with detailed precision. Also, due to the current availability of appropriate software frameworks, numerical simulations of rule-based models are simple to run. Such simulations can be written in pseudocode, using e.g. BNGL [1] or κ [3]. Thus, the emphasis is placed on describing the system's reaction rules, and not in dealing with the numerical simulation algorithm. Hence, custom simulation are easy to create, update, and modify.

A major drawback of previous modelling approaches was that they could express only those systems where there exists a single growing assembly, and all the reactions (addition and dissociation) were between this unique assembly and the free floating tiles. This situation however, does not cover the case where two partial assemblies, each consisting of more than one tile, are interacting. However, using the rule-based modelling framework one can implement such reactions too. This opens the possibility of modelling various variants of TAM (which are closer to experimental implementations), accounting for both assembly-tile interactions and assembly-assembly interactions, e.g., staged or hierarchical tile assembly models.

In our study, we first created a rule-based model of kTAM. Using the same kinetic parameters as in previous numerical implementations of kTAM, we compared the prediction of those models (regarding time-growths and error-fraction of the assembly) with that of our model, see e.g. Figure 2 in the PDF verison of this article. Then, we introduced some small variations into the model and studied how does this affect the overall behaviour of the system. Also, we implemented a rule-based model for staged TAM. Using numerical simulations, we analyzed the possible shortcomings of the combinatorial design in the case of possible experimental implementations.


[1] J. Faeder, M. Blinov & W. Hlavacek (2009): Rule-Based Modeling of Biochemical Systems with BioNetGen. In: Systems Biology, 500, Humana Press, pp. 113-167, doi:10.1007/978-1-59745-525-1 5.

[2] K. Fujibayashi, R. Hariadi, S. Park, E. Winfree & S. Murata (2008): Toward reliable algorithmic self-assembly of DNA tiles: a fixed-width cellular automaton pattern. Nano Lett 8(7), pp. 1791-7.

[3] W. Hlavacek, J. Faeder, M. Blinov, R. Posner, M. Hucka & W. Fontana (2006): Rules for Modeling Signal-Transduction Systems. Sciences STKE.

[4] Y. Krishnan & F. Simmel (2011): Nucleic Acid based molecular devices. Angew Chem Int Ed Engl 50(14), pp. 3124-56.

[5] P. Rothemund & E. Winfree (2000): The program-size complexity of self-assembled squares. In: Proc. 32nd Annual ACM Symp. on Theory of Computing, ACM, pp. 459-468, doi:10.1145/335305.335358.

[6] N. Seeman (1982): Nucleic acid junctions and lattices. Journal of Theoretical Biology 99(2), pp. 237-247, doi:10.1016/0022-5193(82)90002-9.

[7] M. Sneddon, J. Faeder & T. Emonet (2011): Efficient modeling, simulation and coarse-graining of biological complexity with NFsim. Nature methods 8(2), pp. 177-183, doi:10.1038/nmeth.1546.

[8] E. Winfree, F. Liu, L. Wenzler & N. Seeman (1998): Design and self-assembly of two-dimensional DNA crystals. Nature 394(6693), pp. 539-544, doi:10.1038/28998.

[9] Erik Winfree (1998): Simulations of Computing by Self-Assembly. Technical Report CSTR 1998.22, California Institute of Technology.

An Algebraic Approach to Gene Assembly in Ciliates

Robert Brijder (Hasselt University and Transnational University of Limburg, Belgium)

Gene assembly is an intricate process occurring in unicellular organisms called ciliates. During this process a nucleus, called the micronucleus, is transformed into a functionally and structurally different nucleus called the macronucleus. It is accomplished using involved DNA splicing and recombination operations [10, 11]. Gene assembly has been formally studied on the level of individual genes, see, e.g., the monograph [8] and a recent survey [4].

The theory of Euler circuits in 4-regular graphs was initiated in a seminal paper by Kotzig [9]. Bouchet further developed the theory by relating it to delta-matroids [2] and isotropic systems [1, 3]. In [2], Bouchet uses a matrix transformation that turns out to be "almost" principal pivot transform [12]. Principal pivot transform, delta-matroids, and isotropic systems enjoy many interesting properties which have direct consequences for the theory of Euler circuits in 4-regular graphs.

Although, at first glance, the formal theory of gene assembly seems to be related to the theory of Euler circuits in 4-regular graphs, there have been little attempts to fit the former theory into the latter. In this talk we do exactly this. We show that the formal model of gene assembly can be defined quite efficiently in terms of 4-regular graphs. We discuss consequences of known results in the theory of 4- regular graphs (including, e.g., results related to principal pivot transform and delta-matroids) for the theory of gene assembly. This talk is based on the articles [6, 5, 7].


[1] A. Bouchet (1987): Isotropic systems. European Journal of Combinatorics 8, pp. 231-244.

[2] A. Bouchet (1987): Representability of Δ-matroids. In: Proc. 6th Hungarian Colloquium of Combinatorics, Colloquia Mathematica Societatis János Bolyai, 52, North-Holland, pp. 167-182.

[3] A. Bouchet (1988): Graphic presentations of isotropic systems. Journal of Combinatorial Theory, Series B 45(1), pp. 58-76, doi:10.1016/0095-8956(88)90055-X.

[4] R. Brijder, M. Daley, T. Harju, N. Jonoska, I. Petre & G. Rozenberg (2012): Computational Nature of Gene Assembly in Ciliates. In G. Rozenberg, T.H.W. Bäck & J.N. Kok, editors: Handbook of Natural Computing, chapter 37, 3, Springer, pp. 1233-1280, doi:10.1007/978-3-540-92910-9_37.

[5] R. Brijder, T. Harju & H.J. Hoogeboom (2012): Pivots, Determinants, and Perfect Matchings of Graphs. Theoretical Computer Science 454, pp. 64-71, doi:10.1016/j.tcs.2012.02.031.

[6] R. Brijder & H.J. Hoogeboom (2010): Maximal Pivots on Graphs with an Application to Gene Assembly. Discrete Applied Mathematics 158(18), pp. 1977-1985, doi:10.1016/j.dam.2010.08.030.

[7] R. Brijder & H.J. Hoogeboom (2013): The Algebra of Gene Assembly in Ciliates. In N. Jonoska & M. Saito, editors: Discrete and Topological Models in Molecular Biology. To appear.

[8] A. Ehrenfeucht, T. Harju, I. Petre, D.M. Prescott & G. Rozenberg (2004): Computation in Living Cells - Gene Assembly in Ciliates. Springer Verlag.

[9] A. Kotzig (1968): Eulerian lines in finite 4-valent graphs and their transformations. In: Theory of graphs, Proceedings of the Colloquium, Tihany, Hungary, 1966, Academic Press, New York, pp. 219-230.

[10] D.M. Prescott (2000): Genome gymnastics: Unique modes of DNA evolution and processing in ciliates. Nature Reviews 1, pp. 191-199, doi:10.1038/35042057.

[11] D.M. Prescott, A. Ehrenfeucht & G. Rozenberg (2001): Molecular operations for DNA processing in hy- potrichous ciliates. European Journal of Protistology 37, pp. 241-260, doi:10.1078/0932-4739-00807.

[12] M.J. Tsatsomeros (2000): Principal pivot transforms: properties and applications. Linear Algebra and its Applications 307(1-3), pp. 151-165, doi:10.1016/S0024-3795(99)00281-5.

Large-Scale Text Mining of Biomedical Literature

Filip Ginter (Department of Information Technology, University of Turku, Turku, Finland)

Natural Language Processing in the Biomedical Domain (BioNLP) deals with text mining of scientific biomedical literature, most typically represented by the PubMed abstracts, and less frequently also the Open Access subset of PubMed Central full article texts. The methods of BioNLP are many, and address a wide variety of tasks relevant to information search and indexing, as well as hypothesis generation.

Recently, the particular problem of event extraction has become the focus of a community-wide effort, due to the series of BioNLP Shared Tasks on Event Extraction [2]. Events, as defined within the Shared Tasks, are an expressive, formal representation of statements in the underlying text. An event is a structure with a type from a pre-defined type vocabulary (such as Regulation and Protein catabolism), and arguments with a role, typically Cause or Theme. An important property of events is that their arguments are not only physical entities such as genes and proteins, but also other events, thus forming recursively nested structures. An example event is shown in Figure 1 (upper half); see the PDF version of this article.

The BioNLP Shared Tasks led to a rapid progress in the development of advanced event extraction systems. Even though much effort was invested in their development, only few of these systems were made publicly available and deployed in a real-world setting. We have applied the Turku Event Extraction System [1] to the entire PubMed (22M abstracts) and PubMed Central Open Access subset (460K full text articles), creating a large-scale dataset consisting of 40M biomolecular events among 76M gene and protein text mentions [4].

Event extraction systems treat gene and protein mentions as simple strings. Such approach does not allow an easy integration with other resources due to the inherent ambiguity of these mentions, whereby one symbol can refer to any number of genes/proteins across many species and conversely a single gene/protein can be referred to by a number of synonymous symbols. In order to be able to utilize the extracted events, we carried out a gene normalization step [5], assigning the gene/protein mentions in text to their Entrez Gene identifier. The events can consequently be aggregated across different documents and, more importantly, related to the wealth of other databases with information about the genes and proteins involved. Finally, through homology-based gene family definitions, such as HomoloGene [3], events can be aggregated over entire gene families allowing for homology-based hypothesis generation. The gene normalization and gene family assignment steps are illustrated in Figure 1 (lower half); see the PDF version of this article.

While the recursive nesting of events allows a more faithful formal representation of the underlying textual statements, the events prove difficult to manipulate and integrate with other resources. Therefore, we also introduce a network view which re-interprets the complex event structures into a network form with genes and proteins as nodes and events as edges. This network can be easily integrated with other resources such as co-expression assays, regulatory network predictions, and known interaction databases.

The resulting resource, EVEX [4], can be accessed using a web interface as well as downloaded under an open license.

Acknowledgment: This work was carried out in a close collaboration with Dr. Sofie Van Landeghem, VIB/Ghent University, Belgium.


[1] Jari Björne & Tapio Salakoski (2011): Generalizing biomedical event extraction. In: Proceedings of BioNLP Shared Task 2011, Association for Computational Linguistics, pp. 183-191. Available at

[2] Jin-Dong Kim, Ngan Nguyen, Yue Wang, Jun'Ichi Tsujii, Toshihisa Takagi & Akinori Yonezawa (2012): The Genia Event and Protein Coreference tasks of the BioNLP Shared Task 2011. BMC Bioinformatics13(Suppl 11), p. S1, doi:10.1186/1471-2105-13-S11-S1.

[3] Eric W. Sayers et al. (2012): Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 40(D1), pp. D13-D25, doi:10.1093/nar/gkr1184.

[4] Sofie Van Landeghem, Jari Björne, Chih-Hsuan Wei, Kai Hakala, Sampo Pyysalo, Sophia Ananiadou, Hung-Yu Kao, Zhiyong Lu, Tapio Salakoski, Yves Van de Peer & Filip Ginter (2013): Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization. PLoS ONE 8(4), p. e55814, doi:10.1371/journal.pone.0055814.

[5] Chih-Hsuan Wei & Hung-Yu Kao (2011): Cross-species gene normalization by species inference. BMC bioinformatics 12(Suppl 8), p. S5, doi:10.1186/1471-2105-12-S8-S5.