2020 Science

From Datafedwiki

Jump to: navigation, search

Contents

[edit] 2020 Science

[edit] Part 1: Laying the Ground

[edit] Computational Science

Scientific computing platforms and infrastructures are making possible new kinds of experiments that would have been impossible to conduct only 10 years ago, changing the way scientists do science.

Data management is a major issue. It is necessary to merge the capabilities of a file system to store and transmit bulk data from experiments, with logical organisation of files into indexed data collections, allowing efficient query and analytical operations. It is also necessary to incorporate extensive metadata describing each experiment and the data it produced. Rather than flat files traditionally used in scientific data processing, the full power of relational databases is needed to allow effective interactions with the data, and an interface which can be exploited by the extensive scientific toolkits available, for purposes such as visualisation and plotting.

Disciplines require support for diverse types of tasks. Astronomy, for example, has far more emphasis on the collation and curation of federated datasets held at disparate sites. In the life sciences, the problems are far more related to heterogeneous, dispersed data rather than computation. The harder problem for the future is heterogeneity, of platforms, data and applications, rather than simply the scale of the deployed resources. The goal should be to allow scientists to ‘look at’ the data easily, wherever it may be, with sufficient processing power for any desired algorithm to process it.

Next Decade: Muli-Core CPU. We postulate that most aspects of computing will see exponential growth in bandwidth but sub-linear or no improvements at all in latency. Moore’s Law will continue to deliver exponential increases in memory size but the speed with which data can be transferred between memory and CPUs will remain more or less constant and marginal improvements can only be made through advances in caching technology. Likewise, Moore’s law will allow the creation of parallel computing capabilities on single chips by packing multiple CPU cores onto it.

Networking bandwidth will continue to grow exponentially but we are approaching the speed of light as a floor for latency of network packet delivery.We will continue to see exponential growth in disk capacity but the speed with which disks rotate and heads move, factors which determine latency of data transfer, will grow sub-linearly at best, or more likely remain constant.

From an application development point of view, this will require a fundamental paradigm shift from the currently prevailing sequential or parallel programming approach in scientific applications to a mix of parallel and distributed programming that builds programs that exploit low latency in multi core CPUs but are explicitly designed to cope with high latency whenever the task at hand requires more computational resources than can be provided by a single machine.

Lack of further improvement in network latency means that the currently prevailing synchronous approach to distributed programming, for example, using remote procedure call primitives, will have to be replaced with a fundamentally more delay-tolerant and failure-resilient asynchronous programming approach.

Next Decade: Peer-to-Peer and Service Oriented Architectures. P15: Network latency means that the currently prevailing synchronous approach to distributed programming, for example, using remote procedure call primitives, will have to be replaced with a fundamentally more delay-tolerant and failure-resilient asynchronous programming approach. A first step in that direction is peer-to-peer and service-oriented architectures that have emerged and support reuse of both functionality and data in cross-organisational distributed computing settings.

Peer-to-peer (P2P) architectures support the construction of distributed systems without any centralised control or hierarchical organisation [5].These architectures have been successfully used to support file sharing most notably of multi-media files.We expect that computational science applications will increasingly use P2P architectures and protocols to achieve scalable and reliable location and exchange of scientific data and software in a decentralised manner.

While P2P systems support reuse of data, the paradigm of service-oriented architectures (SOA) and the web-service infrastructures [6] that assist in their implementation facilitate reuse of functionality.

In order to take advantage of distributed computing resources in a grid, scientists will increasingly also have to reuse code, interface definitions, data schemas and the distributed computing middleware required to interact in a cluster or grid.

The fundamental primitive that SOA infrastructures provide is the ability to locate and invoke a service across machine and organisational boundaries, both in a synchronous and an asynchronous manner.The implementation of a service can be achieved by wrapping legacy scientific application code and resource schedulers, which allows for a viable migration path.

Computational scientists will be able to flexibly orchestrate these services into computational workflows. The standards available for service orchestration [7] and their implementation in industry strength products support the rapid definition and execution of scientific workflows [8].

[edit] Semantics of Data

A revolution is taking place in the scientific method.“Hypothesize, design and run experiment, analyze results” is being replaced by “hypothesize, look up answer in data base” [9]. Databases are an essential part of the infrastructure of science. They may contain raw data, the results of computational analyses or simulations, or the product of annotation and organisation of data.

The development of an infrastructure for scientific data management is therefore essential. This poses major challenges for both database and programming language research, which differ from the conventional (business) requirements of databases.

A major issue is the distribution of data. Database technology has recognised for a long time that it is expensive or impossible to move large quantities of data. Instead one moves the code (software executing a program) to the data, and this is the core of distributed query optimisation.

-..turn data into service ...need core services (filter, aggregate, fuse) along SQL (select, where, join, group by, order...-

Second, we need to extend distributed query optimisation, which works for the simple operations of relational algebra, to work for more general operations that support scientific programming and to include, for example, spatial queries, string searches, etc.Known database techniques, such as parallel processing, set-oriented data access and intelligent indexing need to be extended, where possible, to support scientific data types.Third,we are facing much greater heterogeneity: individual data or document pieces require specific remote evaluation???.

But this is just the base technology that has to be developed. It must be supported by a computing environment in which it is easy for scientists to exploit the infrastructure. First and foremost is the semantics of data. This involves an understanding of the metadata, the quality of the data, where and how it was produced, intellectual property, etc.This ‘data about data’ is not simply for human consumption, it is primarily used by tools that perform data integration and exploit web services that, for instance, transform the data or compute new derived data. Furthermore, the environment should facilitate standard tasks such as querying, programming, mining or task orchestration (workflow) and it should make it possible for scientists to generate their own computing tasks, rather than being reliant on database experts.

We believe that attempts to solve the issues of scientific data management by building large, centralised, archival repositories are both dangerous and unworkable. They are dangerous because the construction of a data collection or the survival of one’s data is at the mercy of a specific administrative or financial structure; unworkable because of scale, and also because scientists naturally favour autonomy and wish to keep control over their information. When it is necessary to bring large quantities of data together for centralised computing, this should be done by replication, appropriate restructuring and semantic integration when necessary.

With this move towards reliance on highly distributed and highly derived data, there is a largely unsolved problem of preserving the scientific record.There are frequent complaints that by placing data on the web (as opposed to conventional publications or centralised database approaches), essential information has been lost. How do we record the details of the highly complex process by which a data set was derived? How do we preserve the history of a data set that changes all the time? How do we find the origin of data that has been repeatedly copied between data sources? Such issues have to be resolved to offer a convincing infrastructure for scientific data management.

Finally, we note that the future of databases in science is as much a social as a technical issue. Scientific funding organisations are increasingly requiring researchers to publish their data. But it is important that there are agreed community standards for publishing metadata, citations and provenance. Only if we have these will the data we are generating today be usable by applications of the future.

[edit] Intelligent Interaction and Information Discovery

A significant change in scientists’ ability to analyse data to obtain a better understanding of natural phenomena will be enabled by

  • (i) new ways to manage massive amounts of data from observations and scientific simulations
  • (ii) integration of powerful analysis tools directly into the database
  • (iii) improved forms of scientist-computer-data interaction that support visualisation and interactivity
  • (iv) active data, notification, and workflows to enhance the multi stage data analysis among scientists distributed around the globe, and
  • (v) transformation of scientific communication and publishing.

Managing Data Explosion. Scientific data are increasing exponentially. Scientists have difficulty in keeping up with this ‘data deluge’ [10]. It is increasingly clear that, as a consequence, the way scientists interact with the data and with one another is undergoing a fundamental paradigm shift. The traditional sequence of ‘experiment › analysis › publication’ is changing to ‘experiment › data organisation › analysis › publication’ as more and more scientific data are ingested directly into databases, even before the data are analysed (see also section ‘Transforming Scientific Communication’). Today, data are not only generated by experiments, but by large numerical simulations. The challenge is to extract information and insights from the data without being hindered by the task of managing it.

Adaptive organisation and placement of data. Since network speeds to most academic locations are not keeping up with the size of and demand for data, in many cases scientists will not be able to copy data to their own machines; the analysis needs to be run closer to the data.As a result, data archives will have to offer access to analysis tools (and computational resources) and provide some ‘private’ workspace – all this will allow for laboratory – and discipline-spanning collaboration while also helping to curb the exploding network traffic.

Data stores will need to be capable of being extended to absorb the software packages containing the algorithms for data analysis required by scientists, better divide-and-conquer techniques are needed to help break through the polynomial complexity of existing algorithms, and better distributed, and loosely-coupled techniques (e.g. Web services) are required in order to distribute, exchange, and share results among expert scientific communities. Most scientists will only look at a small part of the available data. If this ‘hot’ data is mirrored at several locations, and this hierarchical process is repeated at several levels, one can have a system where both the I/O and the computational load are much better distributed. As a result, large databases will be complemented by a federated hierarchy of smaller, specialised databases.

Move from batch oriented to interactive data-centric workflow systems are necessary to move from a batch-oriented in which scientists can control the processing based on visualisations and real-time analysis.

Tools for data analysis. The demand for tools and computational resources to perform scientific data analysis is rising even faster than data volumes:

  • more sophisticated algorithms consume more instructions to analyse each byte
  • many analysis algorithms are polynomial, often needing N2 or N3 time to process N data points; and
  • I/O bandwidth has not kept pace with storage capacity.

In the last decade, while capacity has grown more than 100-fold, storage bandwidth has improved only about 10-fold.

Integrated symbolic computation, data mining and analysis. Data mining algorithms allow scientists to automatically extract valid, authentic and actionable patterns, trends and knowledge from large data sets. Data mining algorithms such as automatic decision tree classifiers, data clusters, Bayesian predictions, association discovery, sequence clustering, time series, neural networks, logistic regression, and linear regression integrated directly in database engines will increase the scientist’s ability to discover interesting patterns in their observations and experiments.

Data cubes, data visualisation and rapid application development. Large observational data sets, the results of massive numerical computations, and high-dimensional theoretical work all share one need: visualisation. Observational data sets such as astronomical surveys, seismic sensor output, tectonic drift data, ephemeris data, protein shapes, and so on, are infeasible to comprehend without exploiting the human visual system.

Similarly, finite-element simulations, thunderstorm simulations, solid-state physics, many-body problems, and many others depend on visualisation for interpretation of results and feedback into hypothesis formation.

Many scientists, when faced with large amounts of data want to create multidimensional aggregations, where they can experiment with various correlations between the measured and derived quantities. Much of this work today is done through files, using home-brew codes or simple spreadsheets. Most scientists are not even aware that tools like Online Analytical Processing (OLAP) data cubes are available as add-ons to the database engines. Smart data cubes play a twofold role. First, they serve as caches or replicas of pre-computed, multi-dimensional ggregations that facilitate data analysis from multiple perspectives. Second, they support the visualisation of data over data partitions. Given the deluge of data scientists need to deal with, we also need to use data mining techniques to facilitate automatic detection of interesting patterns in the data.

An important way for database technology to aid the process is first through transformation of schematised large-scale science data into schematised small-scale formats, then through transformation of small-scale formats into standardised graphical data structures such as meshes, textures and voxels. The first kind of transformation fits into the category of OLAP, which is a staple of the business community.The second kind of transformation is an exciting area for applied R&D.

Empowering data-intensive scientists. The final piece that brings all the above advances in data management, analysis, knowledge discovery and visualisation together to empower the scientist to achieve new scientific breakthroughs is a truly smart lab notebook. Such a device would unlock access to data and would make it extremely easy to capture, organise, analyse, discover, visualise and publish new phenomena [13]. However, the outline of developments under way presented here suggests that a truly smart lab notebook will be in scientists’ hands quite some time before 2020.

Summary The challenges of modern science require an intense interaction of the scientists with huge and complex data sets.The globally distributed nature of science means that both scientific collaborations and the data are also spread globally. As our analyses are becoming more elaborate, we need advanced techniques to manipulate, visualise and interpret our data.We expect that paradigm will soon emerge for the scientist–data interaction which will act as a window into the large space of specialised data sources and analysis services, making use of all the services mentioned above (discovery of data and analysis services, data administration and management tasks) in a way that is largely hidden to the scientist. Many sciences share these data management, analysis and visualisation challenges, thus we expect a generic solution is not only possible but will have a broad impact.

[edit] Transforming Scientific Communication

Changes in scientific publishing and communication will occur in five main areas of development:

  • interactive figures and new navigation interfaces;
  • customisation and personalisation;
  • the relationship between journals and databases;
  • user participation;
  • searching and alerting services.

Data display. Provide the reader with a degree of interactivity, especially in figures. Applications of Flash®, SVG and similar technologies are not limited to figures - they should also provide new search and navigation interfaces.

Dynamic delivery. Online pages can be generated the moment they are requested, thus allowing customisation (according to a particular time or place) and personalisation (according to a particular user). Some, reading outside their main area of study, may only want a brief, superficial summary. Others may want only to scan the abstract and figures. And others still may want to read the whole paper, including accompanying supplementary information.

Deep data. Scientific communication is dominated by journals and databases but they are poorly integrated with one another, and that each has not adopted more of the strengths of the other. A new breed of scientific publication will emerge that will cater primarily for researchers who wish to publish valuable scientific data for others to analyse. The main technical challenge here is the sheer volume of data. Just as crucial is to publish data sets in structured and machine-readable formats. Indeed, publishers also have a role in helping to promote the use of such formats.e.g. Systems Biology Markup Language (SBML; http://www.sbml.org/

Discussion and dialogue. The meme of the moment is the ‘two-way web’ in which users are not merely passive consumers but active participants. Certain websites (e.g. eBay®, Blogger™, and Wikipedia) create environments in which users contribute content and services, and generally interact with each other, without directly involving the service provider. Another example is social bookmarking services such as Connotea, which caters specifically for the needs of scientists (http://www.connotea.org/). It seems clear that services like these will become an important way for scientists to organise, share and discover information, building and extending on-line collaborative social networks.

Digital discovery. As the volumes of scientific text and data continue to balloon, finding timely, relevant information is an increasing challenge for researchers in every discipline. Scholarly search services such as PubMed, Google™ Scholar and Astrophysics Data System certainly help a lot. ..how about anootation-labeling of content?. The scientific paper as a means of communication is here to stay for the foreseeable future, despite the continuing online revolution. But it will inevitably evolve in response to scientific needs and new enabling technologies.

[edit] Computational Thinking

This report argues strongly that computer science can make a major, if not reforming contribution to the natural sciences. Natural sciences are defined with reference to the world in which we live and the scientific methods, laws and and theories to explain what is observed. Computer science as a discipline is harder to define. For that reason, we set out in broad terms what we believe computer science is so as to anchor the subsequent discussion. Computer science is perhaps best characterised by the way in which computer scientists approach solving problems, designing systems and understanding human behaviour in the context of those systems.

Computational Thinking. To reading, writing, and arithmetic, add computational thinking to every child’s analytical ability. It includes a range of “mental tools” that reflect the breadth of our field.When faced with a problem to solve, we might first ask “How difficult would it be to solve?” and second, “What’s the best way to solve it?”. Computational thinking is reformulating a seemingly difficult problem into one we know how to solve, perhaps by reduction, embedding, transformation, or simulation. Computational thinking is type checking, as the generalization of dimensional analysis. Computational thinking is choosing an appropriate representation for a problem or modelling the relevant aspects of a problem to make it tractable. Computational thinking is using abstraction and decomposition when tackling a large complex task or designing a large complex system. In short, computational thinking is taking an approach to solving problems, designing systems, and understanding human behaviour that draws on the concepts fundamental to computer science.


[edit] Data Managemnt

The growth in, availability of, and need for a vast amount of highly heterogeneous data are accelerating rapidly according to disciplinary needs and interests. These represent different formats, resolutions, qualities and updating regimes.As a consequence, the key challenges are: (i) Evolution towards common or interoperable formats and mark-up protocols, e.g. as a result of efforts under way by organisations such as GBIF (www.gbif.org) or the recently created National Evolutionary Synthesis Center (www.nescent.org), we expect that by 2008 a common naming taxonomy incorporated in Web services will enable data from the diverse sources around the planet to be linked by any user; (ii) Capability to treat (manage, manipulate, analyse and visualise), already terabyte and soon to be petabyte datasets, which will further increase dramatically when data acquisition by sensors becomes common. The developments required to support such activities are discussed extensively in Part 1 of this roadmap.

[edit] Analysis and modelling of complex systems

In ecology and evolutionary biology, analytical exploration of simple deterministic models has been dominant historically. However, many-species, non-linear, nonlocally interacting, spatially-explicit dynamic modelling of large areas and at high resolution demands the development of new modelling techniques and associated parameter estimation and model validation [53].Promising as these techniques are, by far the most challenging task is to integrate the heterogeneous, fast-growing amount of primary biodiversity data into a coherent theoretical (and hopefully predictive) framework. For example, the development of novel and efficient algorithms to link niche models used to predict species distributions to phylogenetic analysis in a spatially explicit context. Increased availability and use of data, and simulation power, without a comprehensive formal and theoretical scaffolding will not be enough. In addition to continuing developments in biostatistics and non-linear dynamics, computer science has the potential to provide theoretical paradigms and methods to represent formally the emerging biological knowledge.The development of formal languages oriented to represent and display complex sets of relations and describe interactions of heterogeneous sets of entities will help much of biology to become more rigorous and theoretical and less verbose and descriptive. (see the subsection ‘Codification of Biology’ in Part 2 and the section ‘Global Epidemics’ below).

Computer science and computing have an essential role to play in helping understand our environment and ecosystems.The challenges run from providing more powerful hardware and the software infrastructure for new tools and methodologies for the acquisition, management and analysis of enormously complex and voluminous data, to underpinning robust new theoretical paradigms. By conquering these challenges we will be in a much better position to manage and conserve the ecosystem underpinning our life-support systems.

[edit] Part 2: The Building Blocks of a Scientific Revolution

[edit] New Software Models for New Kinds of Tools

As numerous sections of this report make clear, science will become increasingly reliant upon new kinds of tools towards 2020 – and increasingly highly novel software-based tools. As a result, new approaches to the development of softwarebased tools for science will be required to enable important scientific breakthroughs and enable scientists to be productive. At the same time, scientists will be required to be increasingly computationally competent in ways that benefit the overall scientific community. Both of these demands present considerable software challenges. How these challenges can be addressed requires a consideration of trends and developments in a number of areas discussed next.

Software engineering Single processors with uniform memory systems and hierarchies are being replaced by non-uniform multi-processor (‘multi-core’) systems that defy present programming models. Scaling already challenged by huge datasets is now also challenged by dramatically more complex computing platforms, previously only known to supercomputing. The challenge is multiplied by the need to integrate across the Internet. The implications for science are potentially enormous. Concurrent programming on multi-core machines is likely to mean scientists become more reliant on software platforms and third party libraries to accrue the benefits of this processing power.Validating computational results will also become more difficult, as non-determinism may become normal for complex calculations. Combining the trend towards non-uniform, parallel hardware with the computational needs of science (perhaps more than any other area) to draw on the limits of hardware, and the need to embrace ever-more complex approaches leads to a tough challenge: to devise models enabling robust design and development of software components and frameworks and the flexible composition of components, in combination with ad-hoc code, to address the needs of rapidly diversifying, converging and evolving sciences.This is a challenge for software and software engineering.

We need new programming models to cope with such new hardware topologies. This will require coordinated scientific efforts to build sharable component frameworks and matching add-in components. New software architecture methods, especially to support factoring and recomposition (to enable effective sharing among collaborating teams or even within and across disciplines) require an emphasis on components, component frameworks, component technologies, properly supported by architectural concepts.

Componentisation Software presents itself at two levels: source and executable. Sharing source has its advantages, but as a unit of sharable software, source-code fragments are too brittle. A solid concept of software components is instead required where components are viewed as units of deployment. A deployable unit is something sitting between source code and installable, executable code enabling the delivery of software components parameterised for a range of scenarios. Components can also encapsulate expertise – not everyone using a component would have to be capable of developing an equivalent component. In the context of science, it is most compelling to consider both arguments: efficient sharing of components and encapsulation of expertise to leverage complementary skills. Shared use of components in an unbounded number of compositions enables a systematic approach to the support for evolving scientific projects. It is critically important to understand the boundary conditions that enable composability. 'Components only compose in the context of a component framework' [40,41] – reference standards that establish strong conditions for integration such that components designed to meet a component framework, even if developed by mutually unaware parties, will compose.

Software services Providing specialised software services (such as up-to-date genome catalogues) is compelling in science. It is conceivable that both government and industrial funds will help maintain such services, but the absence of a simple ‘business model’ leads to a reliance on some form of sponsorship. The actual sharing of computational resources (in a way, the most basic software service that can be offered), as envisaged by some Grid projects, seems less compelling.There are a few examples (like the SETI@Home project) that lend themselves to this approach since the underlying computation is parallel and the data sets that need to be distributed are relatively small. In most cases, however, computational resources are cheap and widely available. It would therefore seem that the sharing of services is much more interesting than that of computational resources.

Software engineeringSoftware engineering for science has to address three fundamental dimensions:

  • dealing with datasets that are large in size, number, and variations;
  • construction of new algorithms and structures to perform novel analyses and syntheses; and
  • sharing of assets across wide and diverse communities.

Algorithms and methods need to be developed that self-adapt and self-tune?? to cover the wide range of usage scenarios in science. In addition, there is a need to develop libraries of componentised assets that can be generated for a broad spectrum of platforms, preferably targeting software platforms (such as managedcode platforms) that shield software assets from a good part of the underlying hardware variety. In many cases, supporting little languages that match the domains of particular component libraries can help reduce what would be complex problems to natural programming tasks. Platforms that integrate across a broad range of languages and tools can enable such an approach.

To move beyond applications and enable broader integration, service-oriented architectures are required: applications need to be built such that they can mutually offer and draw on each other’s services. However, applications that are temporarily disconnected (partially or totally) need to continue offering some autonomous value. This is critically important to avoid unmanageable dependences across loosely coordinated research organisations. Service orientation can also enable the moving of computational activity to where datasets reside; a strategy that, for huge datasets, is often preferable over the traditional approach to move datasets to where computation takes place.

Programming platforms

Managed platforms A significant trend in software development for the last 10 years has been the move from programming languages to programming platforms, exemplified primarily by Java™ and the Microsoft® .NET™ Framework. These ‘managed platforms’ encompass: • platform-oriented languages (e.g. Java™,Visual C#® and others); • a virtualised, high-performance, secure runtime engine; • base libraries suitable for modern programming tasks such as concurrency, distribution and networking; • key abstractions enabling extensibility and componentisation; • visualisation, graphics and media engines; • integration with related technologies such as databases and servers; • a range of related software design and analysis tools; • support for a range of interoperability formats. Managed platforms dominate commercial programming and is a trend we expect to continue to grow in science, and indeed the dominance of managed code in science is both desirable and almost certainly unavoidable. Notwithstanding this, many niche areas of software development exist where alternatives and/ or enhancements of managed platforms are deployed and used by scientists, including Python, Perl, Scheme, Fortran, C++, MATLAB® (by MathWorks; http://www.mathworks.com) and the language R (http://www.r-project.org/).

A key feature of managed platforms is that they combine multiple approaches to compilation and execution, enabling the platforms to be configured for a range of development and deployment tasks. Increasingly, managed platforms are also heterogeneous in the sense that many different kinds of programming are addressed by the platform; examples include: • Domain-specific embedded languages for computation (e.g. utilising graphics hardware for matrix computations) • Mathematics-oriented, scalable, script-like programming languages (e.g. F#; http://research.microsoft.com/projects/fsharp) • Interoperable scripting languages (e.g. Iron Python, JPython) • High-quality interoperability layers for existing environments and languages (e.g. MATLAB-to-Java connectivity, also Fortran and Ada for the .NET™ platform)

A key success of managed platforms has been to bring uniformity to kinds of programming that traditionally required ad hoc solutions. For example, it is remarkable that the same programming skills can now be applied throughout the components of a heterogeneous system, e.g.Visual Basic®, C# and Java™ may all be used for client-side web programming, server-side programming, small devices and even within database processes. Hosting a computation at the right locale, such as inside a database, can yield major performance benefits (such as Microsoft’s SQL Server 2005 Stored Procedures [42] and related functionalities in many other databases).This will inevitably result in significant portions of program execution being hosted on remote machines.To match this trend, development tools are also likely to be hosted increasingly.

Discoverability through visualisation during software construction Many of the tools that make up software toolkits focus on ensuring programmers can quickly discover how to use software components. Visual Studio® (http://msdn.microsoft.com/vstudio/) and Eclipse (Eclipse: a kind of universal tool platform; http://www.eclipse.org) support a number of features to allow programmers to quickly discover and navigate through programmatic interfaces. The theme of discoverability now occupies much of the design effort in software development environments. Discoverability through interactive visualisation is likely to be critical for future scientific software development.

Correctness Computer scientists have assumptions that place them in a difficult position when providing for the needs of scientists. One characteristic of scientific programming is that top-level code is ‘write once, run once’ (WORO). Even if a component-sharing community were established, such code will not evolve into published components. However, repeated use in multiple contexts is the foundation to ‘correctness by reputation’. Computer scientists need to accept that highly complex code may be published primarily for the purpose of analysis and validation by other scientists, rather than for direct reuse. Software published primarily for validation will require non-traditional software-engineering techniques. Recent advances in formalised mathematics show how machinechecked mathematical results can be developed in a way where the result is independently checkable through simple means, even if the process of constructing a result was extremely challenging [43].

Scientists as programmersScience is at the fore among the programming disciplines for the extreme demands it places on systems and software development. While tools such as spreadsheets are increasingly used for what are essentially ‘end-user programming’ tasks (a largely neglected trend among computer scientists [44] ), on the whole, existing platforms and scientific programming environments are perceived to be suboptimal for science. While positive about some aspects of platforms, e.g. libraries provided by Java™, .NET™ and/or MATLAB®, common complaints centre around productivity of the scientist/programmer and the lack of a significant componentisation story. Licensing and legal considerations and fear of ‘platform lock-in’ are also a concern. Many scientists are clearly frustrated at the constraints placed on them by the software they have chosen to use, or by externally imposed constraints such as the need to use C++ in order to ensure their code can be added to existing applications.They are also very creative when it comes to working around limitations in interoperability. The fundamentals of a good platform for the working programmer-scientist are clear enough: performance, ease of expression, scalability, scripting, an orientation toward mathematics, science-related libraries, visualisation, tools and community. Drawing on the success of MATLAB®, Mathematica®, spreadsheet systems, etc., it is useful to think of the solution to meet the requirements of science and scientists towards 2020 as a front-end that is something akin to an ‘Office for Science’: an extensible, integrated suite of user-facing applications that remain integrated with the development environment and that help address the many human-computer interaction issues of the sciences.

A new generation of advanced software-based tools will be absolutely critical in science towards 2020. Where scientists today rely on the creation of critical software assets as a side effect of general research, leading to large numbers of weakly maintained and mutually non-integrating libraries and applications, there is a strong need to form collaborative communities that share architecture, service definitions, services, component frameworks, and components to enable the systematic development and maintenance of software assets. It will be less and less likely that even the steepest investments focusing on specific projects will be leveraged in follow-on or peer projects, unless the software-engineering foundation of such efforts is rethought and carefully nurtured. Significant challenges for governments, educators, as well as scientific communities at large are paired with challenges of technology transfer and novel development of appropriate methods and technologies in the field of software engineering. Governing bodies will need to be established and properly funded that help with curation and perhaps coordination to have any hope of progress.

To empower communities of collaborating scientists across diverse organisations, appropriate methods and tools are required. Such tools will have to draw on rich metadata, encoding facts and knowledge, organised using appropriate semantic frameworks. Construction and support of loosely-coupled, collaborative workflows will enable specialists to collaborate on very large projects.Any such collaborative sharing will have to address issues of security, privacy, and provenance. It is essential that scientists be proactive about ensuring their interests are addressed within software platforms. Platform providers must also recognise their unique responsibilities to the sciences, including a social responsibility to ensure the maximum effectiveness of the scientific community as it tackles the scientific problems of the coming century. A key challenge for scientists is to balance the tensions involved in guiding the design of tools on which they are dependent, including the need to (i) remain sufficiently distant from individual platforms in order to reap the benefits of innovation across a range of platforms; (ii) be deeply engaged in development plans for individual platforms to ensure that the requirements of a range of disciplines are catered for; (iii) be pro-active in standardisation efforts and in calling for interoperable solutions; (iv) communicate the peculiarly stringent requirements that science places on software platforms. Science will benefit greatly from a strong, independent, informed, representative voice in the software industry, as will the broader communities served by science. Finally, the trends outlined above will lead to major alterations in how we perceive software and the software construction process, allowing the opposite flow of innovation. As increasingly demanding and complex solutions in science are invented, the resulting solutions are likely to be transferable to the wider software space.

[edit] New Kinds of Communities

New software-based tools will proliferate as they become increasingly essential for doing science. An inevitable consequence will be the combining by scientists of shared and differentiating software – an expression of the balance and tension between collaboration and competition that science benefits from. In the spirit of open science, differentiating software is ideally viewed as a contribution to the sciences and thus is eventually transferred to shared status. The high bar of demanding repeatability of scientific experiments equally demands the standardisation of instruments for all elements of the experimental method, parameter setting and estimation and data collection, collation and treatment – increasingly, this will not just include software, it will depend on software.This also means (i) the sharing of these ‘tools’ by and in the science community, (ii) their re-usability by others wishing to replicate or build upon the experiment(s) in which the tools were used, and (iii) their longevity, ensuring repeatability over prolonged time periods. Thus, it would seem likely that the successful bootstrap of communities that build and share effectively at the level of components, frameworks/architecture, and services would follow a pendulum model, where contributions to the community first ripen in the more closed settings of a local group, followed by an effort to release the more successful pieces to the broader community, followed by more closed enhancements, further developments, and so on. Acknowledging the pendulum process embraces the competitive side of the sciences, with a clear desire of groups to be the first to publish particular results, while still benefiting by standing on the shoulders of the community collectively creating, developing and using scientific software tools. A related challenge is educating computer scientists and software engineers to get such an approach off the ground [45]. Whatever works best for a professional software architect and engineer is not automatically what works best for the dedicated scientist who also ‘does software’, but it should provide guidance. Moreover, certain aspects of the approaches outlined in this report are hard and perhaps most effectively left to professional specialists.This is particularly true for the creation of sufficiently powerful and reasonably future-proof reference architecture and component frameworks. But this in turn requires a much deeper integration and/or relationship between computer science (and computer scientists, as well as software engineers) and the science community. Appropriate integration of professional support into the fabric of the science community is a challenge. Currently, support is typically that contributed by dedicated enthusiasts – often doctoral or post-doctoral students. However, where contributions from such efforts are meant to scale, be sharable, and to compose with efforts from others, there is a need to keep the framing straight and according to code. In conclusion, it is clear that the computer science community and the science community need to work together far more closely to successfully build usable, robust, reliable and scalable tools for doing science. This is already happening in some areas and in some countries, but the scale of the integration required is not going to happen by accident. It will require a dedicated effort by scientists, numerous government initiatives to foster and enable such an integration, and the co-involvement of commercial companies such as Wolfram, MathWorks, IBM®, Apple® and Microsoft®. It is our recommendation that government science agencies take the initiative and introduce schemes and initiatives that enable far greater co-operation and community building between all these elements.

[edit] Part 3: Towards Solving Global Challenges

The 21st Century is already starting to present some of the most important questions, challenges and opportunities in human history. Some have solutions in scientific advances (e.g. health), while others require political or economic solutions (e.g. poverty). Some require significant scientific advances in order to provide the evidence necessary to make fundamental political and economic decisions (e.g. our environment).

[edit] Earth’s Life-Support Systems

Authoritative assessments of the state of the Earth’s life support systems -broadly speaking the ‘biosphere’ (biodiversity, ecosystems and atmosphere) – show major changes in their composition, structure or functioning [46,47]. For example, currently, human activity is producing 300% more carbon dioxide per year than the earth’s natural carbon sinks can absorb [47] and this is expected to increase significantly over the next 2-3 decades at least as growth continues in developing countries such as China, India and South America. The result of this and other human activity is a potentially profound change in climate patterns and the consequent effects this could have. Moreover,we are losing a vital resource for life – the Earth’s biodiversity – at a rate probably 100 times greater than from natural loss [46], and many of the Earth’s natural resources are being grossly overexploited. For example, 90% of Brazil’s 100 million square kilometres of coastal forest, once one of the most diverse ecosystems on Earth, has been destroyed in the past 90 years, and fishing has massively depleted most of the world’s fish populations in just a few decades. Perhaps most worrying is the Millennium Ecosystem Assessment [47] recent evidence that, out of the 24 ‘Life-support services’ that nature provides and that we rely on for our continued existence, 15 are being used far faster than nature can regenerate them – and we should expect this number to rise still farther.

There is a fundamentally urgent need to understand the Earth’s life support systems to the extent that we are able to model and predict the effects of continued trends of human activity on them, and the consequent effect on the ability of life to be sustained on the planet – including humans.This requires the development of extremely powerful predictive models of the complex and interacting factors that determine and influence our ecosystem and environment, and use these models to generate and evaluate strategies to counteract the damage the Earth is being subjected to.

Several areas of science are beginning to tackle this through integrating theory, remote sensing experiments and traditional observational studies, and computational models. Certainly, climatology and several branches of organismic biology (see below) depend increasingly upon computer science since their theories, which formerly were the sole province of mathematical biology, are becoming more and more computational in form [48].

Computational simulation and modelling of the dynamic behaviour of the world’s climate are possible today, thanks to the efforts of numerous research centres, including the Earth Simulator in Japan and the Hadley Centre for Climate Prediction and Research in the UK. Climate modelling and prediction is an obvious and critical aspect of understanding the Earth, but we should anticipate being able to understand and model other key ‘abiotic’ systems. Focusing for example on the ‘interior activities’ of the planet, will influence our capacity to understand plate tectonics and geomagnetism, and by extension, our capacity to anticipate natural disasters such as earthquakes, volcanoes and tsunamis, perhaps as early as by 2010. The other critical aspect of our understanding relies on knowledge of the ‘biotic’ Earth. Organismic biology, the study of biological entities above the cell level, spans the ecology of single populations to that of the whole biosphere, and from micro-evolutionary phenomena to palaeontology, phylogenetics and macroevolution. Across this discipline, the number and size of databases (species observations, phylogenetic trees, morphology, taxonomies, etc.) are growing exponentially, demanding corresponding development of increasingly sophisticated computational, mathematical and statistical routines for data analysis, modelling and integration. The nascent field of biodiversity informatics – the application of computational methods to the management, analysis and interpretation of primary biodiversity data – is beginning to provide tools for those purposes [49]. Clearly, the biotic and abiotic need to be modelled together: at geographical scales, climate is one of the main determinants of species’ distribution and evolution. Climate change will affect not only species’ distributions but also ecosystem functioning [46].These efforts allow the beginnings of the integration of large sets of biodiversity data with climatological parameters [50,51].An important next key step is to incorporate into computational models perhaps the most pervasive of effects, the influence and effect of human activities such as production of global warming gases on climate.This is already under way and will be possible to model effectively by 2010.

The increasing dependence on computing and computer science can be summarised in three key common trends discussed next.

Autonomous experimentation Advances in remote intelligent sensor technology, coupled with advances in machine learning are expected by 2012 to enable both (i) autonomic observation (including identification of species [52]) and (ii) autonomous experimentation (see the subsection ‘Artificial Scientists’ in Part 2), providing highly comprehensive climatic, oceanographic and ecological data among others. It will, however, also hinge on effective distributed data management (see ‘Semantics of Data’ in Part 1). Some major efforts in this direction are the NEON (National Ecological Observatory Network; http://www.neoninc.org/) and the Long-Term Ecological Research Network (LTER, with its associated informatics programme; http://www.ecoinformatics.org/) initiatives.Within 10 years, these and other projects will enable access to a net of high resolution, real time ecological data.

Data managementThe growth in, availability of, and need for a vast amount of highly heterogeneous data are accelerating rapidly according to disciplinary needs and interests. These represent different formats, resolutions, qualities and updating regimes.As a consequence, the key challenges are: (i) Evolution towards common or interoperable formats and mark-up protocols, e.g. as a result of efforts under way by organisations such as GBIF (www.gbif.org) or the recently created National Evolutionary Synthesis Center (www.nescent.org), we expect that by 2008 a common naming taxonomy incorporated in Web services will enable data from the diverse sources around the planet to be linked by any user; (ii) Capability to treat (manage, manipulate, analyse and visualise), already terabyte and soon to be petabyte datasets, which will further increase dramatically when data acquisition by sensors becomes common. The developments required to support such activities are discussed extensively in Part 1 of this roadmap.

Analysis and modelling of complex systems In ecology and evolutionary biology, analytical exploration of simple deterministic models has been dominant historically. However, many-species, non-linear, nonlocally interacting, spatially-explicit dynamic modelling of large areas and at high resolution demands the development of new modelling techniques and associated parameter estimation and model validation [53].Promising as these techniques are, by far the most challenging task is to integrate the heterogeneous, fast-growing amount of primary biodiversity data into a coherent theoretical (and hopefully predictive) framework. For example, the development of novel and efficient algorithms to link niche models used to predict species distributions to phylogenetic analysis in a spatially explicit context. Increased availability and use of data, and simulation power, without a comprehensive formal and theoretical scaffolding will not be enough. In addition to continuing developments in biostatistics and non-linear dynamics, computer science has the potential to provide theoretical paradigms and methods to represent formally the emerging biological knowledge.The development of formal languages oriented to represent and display complex sets of relations and describe interactions of heterogeneous sets of entities will help much of biology to become more rigorous and theoretical and less verbose and descriptive. (see the subsection ‘Codification of Biology’ in Part 2 and the section ‘Global Epidemics’ below).

Computer science and computing have an essential role to play in helping understand our environment and ecosystems.The challenges run from providing more powerful hardware and the software infrastructure for new tools and methodologies for the acquisition, management and analysis of enormously complex and voluminous data, to underpinning robust new theoretical paradigms. By conquering these challenges we will be in a much better position to manage and conserve the ecosystem underpinning our life-support systems.


[edit] Conclusions

From our analysis and findings, we draw three conclusions about science towards 2020: First, a new revolution is just beginning in science. The building blocks of this revolution are concepts, tools and theorems in computer science which are being transformed into revolutionary new conceptual and technological tools with wide-ranging applications in the sciences, especially sciences investigating complex systems, most notably the natural sciences and in particular the biological sciences. Some of us argue that this represents nothing less than the emergence of ‘new kinds’ of science. Second, that this is a starting point for fundamental advances in biology, biotechnology, medicine, and understanding the life-support systems of the Earth upon which the planet’s biota, including our own species, depends. In other words, that the scientific innovation already taking place at the intersection of computer science and other sciences ranging from molecular biology, organic, physical and artificial chemistry and neuroscience to earth sciences, ecosystems science and astrobiology has profound implications for society and for life on Earth. Additionally, such advances may also have significant economic implications. The new conceptual and technological tools we outline here have the potential to accelerate a new era of ‘science-based innovation’ and a consequent new wave of economic growth that could eclipse the last 50 years of ‘technology-based innovation’ characterising the ‘IT revolution’. Economic growth from new health, medical, energy, environmental management, computing and engineering sectors, some of which are unimaginable today is not only entirely plausible, it is happening already. It is occurring as a consequence of the first stages of the scientific revolution now under way, a good example of which is the mapping of the human genome and the technological and economic innovation that has emerged from it. Third, the importance and potentially profound impact of what is occurring already at the intersection of computing, computer science and the other sciences – the basics of which we summarise in this report – is such that we simply cannot afford to ignore or dismiss it.We need to act upon it. It is worth restating that our efforts have not been that of ‘forecasting’ or ‘predicting’. We have simply summarised the developments actually occurring now, together with what we expect to occur as a consequence of emerging advances in computing and science, and what needs to occur in order to address the global challenges and opportunities we are already presented with as we move towards 2020. Government leaders, the science community and policy makers cannot afford to simply ‘wait and see’ or just continue ‘business as usual’. We are in important, exciting, indeed potentially extreme, times in terms of the future of our planet, our society and our economies, and extreme times call for bold measures.We therefore recommend the following immediate next steps as a call to action for the science community, for policy makers, and for government leaders.

[edit] Recommendations

Establish science and science-based innovation at the top of the political agenda

Politicians like to claim that science is important and vital to the future of the economy.They now need to back this claim with action and put science in the premier league of the political agenda. In a way we have never seen before, science really will be absolutely vital to societies, economies and our future on this planet towards 2020; and science-based innovation is likely to at least equal technology-based innovation in its contribution to economic prosperity. Making sure this happens will require governments to be bold about science and its role in the economy and society.

Urgently re-think how we educate tomorrow’s scientists

Education policy makers need urgently to re-consider what needs to be done to produce the kinds of scientists we shall need in the next decade and beyond. Tomorrow’s scientists will be amongst the most valuable assets that any nation will have. What is clear is that science will need new kinds of scientists, many of whom will need to be first-rate in more than one field of science as scientific research increasingly needs to occur across traditional scientific boundaries. As well as being required to be scientifically and mathematically literate, tomorrow’s scientists will also need to be computationally literate. Achieving this urgently requires a re-think of education policies now, not just at the undergraduate and postgraduate training level, but also at the school level since today’s children are tomorrow’s scientists. The education of today’s children – tomorrow’s scientists – is something of such importance that no government can afford to get wrong, for failure to produce first-rate intellectual capital in a highly competitive emerging era of ‘science-based innovation’ will almost certainly carry with it serious economic consequences. Some specific recommendations are: Children: (i) Take far bolder measures to interest children in science and then retain their interest in it and its importance for society; (ii) urgently and dramatically improve the teaching of mathematics and science in schools; (iii) make teaching of computing more than just ‘IT’ classes and how to use PowerPoint®. Make basic principles of computer science, such as abstraction and codification, a core part of the science curriculum. Undergraduates: (i) Make computer science (again, not just ‘computing’) a key element of the science curriculum; (ii) develop into undergraduate education the concept of ‘computational thinking’ (see section on this topic in Part 1). PhD students: (i) Training in research methods (experimental, mathematical and statistical methods) needs to be broadened to include computational methods; (ii) because of increasing interdisciplinarity, universities will need

A call to action to develop new conceptual and technological tools

Science policy makers should establish new dedicated programmes spanning science and technology to research and create the new kinds of conceptual and technological tools we outline in this report, and others we have not even begun to imagine.We believe this is absolutely vital.This will require a highly interdisciplinary focus, and the fostering of new kinds of communities (see the section ‘New Kinds of Communities’ in Part 2). The UK e-Science programme was a good example of what can be done in such programmes.

Develop innovative public private partnerships to accelerate science-based innovation

Governments, universities and businesses need to find new kinds of ways to work together. This is not new of course. In the UK, for example, the Government-commissioned Lambert Review made just such a recommendation after an extensive consultation with business and with universities, and the EU and others have been trying to foster closer academia–industry collaboration and partnerships. However, despite Governments all over Europe, as well as the USA and elsewhere looking to industry to increase their funding of public R&D, few examples of real industry–university collaborations reveal a rosy picture of mutual benefit. Too often, one or the other ends up being dissatisfied.We believe that entirely new kinds of public-private partnerships (PPPs) are needed in order to really accelerate science and science-based innovation. Such new kinds of PPPs are likely to take several forms, and all parties will need to experiment in this area. On industry’s side, it needs to devise new models of R&D to remain competitive and in which universities are not a cheap and temporary source of ‘contract work’ but an essential, strategic partner in their ability to innovate and compete, in what has been termed an ‘Open Innovation’ model of R&D [61]. On academia’s side, it needs to really raise its sights above just ‘getting industry money’, and also look beyond just producing papers (although this is vital for the healthy advancement of knowledge). On Government’s part, science funding agencies need to be able to respond to changes in science and society and the economy quickly enough and with sufficient flexibility to enable industry to engage with universities in such new models of strategic R&D, rather than simply contractually. Establishing new kinds of joint research institutes between government, industry and the science community (see recommendation 5 above) is an interesting, and potentially highly mutually beneficial way forward.

Find better mechanisms to create value from intellectual property

Creating value from technology-based intellectual property (IP), whether created in industry or academia, has a reasonable, if sometimes chequered track record. In science-based innovation, creating value from intellectual property has proven more difficult, with the possible exception of the pharmaceutical sector. Perhaps this has been due in part to a focus on technology rather than science by the venture capital (VC) community.We believe that there is a need for both universities and industry (and probably governments) to find new and better ways to generate value from science-based IP. New approaches and concepts such as an ‘eBay for IP’ should be given consideration.

Use our findings

The beginning of this report makes clear two things. First, that this is just a first attempt to bring together some of the complex issues at the intersection of computing and science towards 2020. It is not a definitive statement but a pointer. Second, that one of the purposes of the report is to help inform and generate discussion about the future of science in the science community. If it helps to generate debate, dissent, ideas, better thought out arguments or indeed direction, it will have served this purpose.

Personal tools
Workspaces
Clicky Web Analytics