Sie sind auf Seite 1von 12

Standardizing European Statistical processes: CORA and CORE projects

Monica Scannapieco, CarloVaccari


Istituto Nazionale di Statistica (Istat) via C. Balbo 16, Rome - Italy e-mail: {scannapi,vaccari}@istat.it

Abstract The CORA (COmmon Reference Architecture) project is a research network financed by Eurostat under the 2009 Statistical Programme. The principal result of the project has been the definition of an architecture to be assumed as a reference by National Statistical Institutes (NSIs). Such an architecture has been articulated according to three distinct dimensions, namely technical, organizational and business. The organizational dimension of CORA is based on a survey of the commercial and legal foundations for the exchange of software between NSIs. The technical and business dimensions of the CORA architecture have been structured according to a layered approach, in which lower layers offer services to upper ones. The first dimension of the technical architecture develops itself alongside GSBPM (Generic Statistical Business Process Model). The second dimension, called the construction dimension, is determined by the way services make use of one another to deliver their respective products (data and metadata). These two dimensions give rise to a grid according to which it is possible to design statistical processes in terms of services, which belong to defined layers, and which are able to invoke only services of lower layers. The design of a CORA-compliant service concretely permits to wrap existing software solutions and to plug them within CORA-compliant processes. In order to empirically test technological independence of CORA, proof-of-concept prototypes of some components of the architecture were developed in both .Net and Java platforms. CORA project has been followed by CORE (COmmon Reference Environment), a project started in 2011. Starting from CORA results, CORE will extend the architectural model by defining a complete information model that takes into account process modeling, and specifically definition of sub-processes and communication interfaces. CORE will also provide a software environment for the automated execution of statistical processes, defined according to the designed information model. Keywords: common architecture, IT standardization, software reuse, software sharing, statistical processes, statistical tools

1. - Introduction
The CORA (COmmon Reference Architecture [1]) project is an ESSnet [2] financed by Eurostat under the 2009 Statistical Programme. It started on October 2009 and ended on November 2010, involving seven National Statistical Institutes (NSIs), specifically: Istat (Italian National Institute of Statistics) as coordinator, DST (Statistics Denmark), SCB

(Statistics Sweden), LS (Latvia Statistics), SFSO (Swiss Statistics), SSB (Statistics Norway) and CBS (Statistics Netherlands). In the paper, we describe the principal result of the project, that is the definition of an architecture to be assumed as a reference by NSIs. Such an architecture has been articulated according to three distinct dimensions, namely technical, business and organizational. NSIs are currently supported by different IT platforms, as well as by different business and organizational rules. However, all NSIs share similar goals and are in charge of similar surveys, so in principle each of them could be supported by a unique, common architecture. The adoption of a common reference architecture leads to: (a) a stronger cooperation among NSIs; (b) the sharing of IT solutions and a consequent cost reduction; (c) a quality improvement of existing solutions for both business and technical aspects of NSIs. The technical and business dimensions of the CORA architecture have been structured according to a layered approach, in which lower layers offer services to upper ones. This approach has the advantage of providing clear contracts between the components at each layer that, in this way, have precise duties and rights. The first dimension of the technical architecture develops itself alongside GSBPM (Generic Statistical Business Process Model [3]). The second dimension, called the construction dimension, is determined by the way services make use of one another to deliver their respective products (data and metadata [4]). These two dimensions give rise to a grid according to which it is possible to design statistical processes in terms of services, which belong to defined layers, and which are able to invoke only services of lower layers. Interfaces between services have been defined in order to guarantee their cooperation. The design of a CORA-compliant service concretely permits to wrap existing software solutions and to plug them within CORAcompliant processes. A basic principle of the CORA architecture is the technological independence. In order to empirically test such a feature, proof-of-concepts prototypes of some components of the architecture were developed in both .Net and Java platforms. The organizational dimension of CORA is based on a survey of the commercial and legal foundations for the exchange of software between NSIs. With the aim of defining suitable licensing models of sharing applications, we made a thorough analysis of the available licensing models, detailing characteristics, obligations of the licenses and correspondences between different licenses.

2. - Project Organization
CORA project activities was divided into two managerial workpackages (WP1 - Project Management and WP5 Project Dissemination) and in three technical workpackages, namely: Requirement and state of the art analysis (WP2), with the aim of (i) collecting requirements by different NSIs, both participants to the project and not, and (ii) performing a state of the art analysis related to available business and technical standardization solutions already adopted by NSIs. The work performed in this workpackage is detailed in Section 3;

Technical Architecture (WP3), aiming at providing the technical specification of the CORA architecture. The work performed in this workpackage is detailed in Section 4; Organizational architecture (WP4), with the purpose of reviewing the commercial and legal foundations for the exchange of software between NSIs. The work performed in this workpackage is detailed in Section 5. WP2 provided inputs to both the other workpackages. More specifically, existing shared software tools were collected by WP2 and passed to WP4 that analysed technical and commercial requirements of such tools. Collected technical and business requirements were instead inputs mainly to WP3 work. The described relationship between the technical wps of CORA is shown in Figure 1.
WP3

Business &Technical Requirements WP2 Shared Tools

WP4

Figure 1: relationship among CORA technical workpackages

3. - Requirements and State of the Art


Workpackage 2 collected requirements and produced a state of the art report with respect to architectures and software products currently adopted by NSIs. To this purpose, a questionnaire was designed and distributed to 42 NSIs composed of 27 EU, 3 EFTA, 3 candidate countries, 4 potential candidate countries and 5 other MSIS [5] countries. We had a response rate of about 79% that is 33 NSIs out of 42 contacted, namely 21 EU, 3 EFTA (European Free Trade Association), 3 candidate countries, 3 potential candidate countries and 3 other MSIS countries. The questionnaire covered the following topics: Statistical business process models adopted by the NSIs; GSBPM sub-processes for phase 5 Process; Enterprise Architectures; Tools: Shared software: software tools developed by NSIs and currently shared with other NSIs; Candidate software: software tools under development or developed by NSIs that are not currently shared but that could be further developed to serve a wider number of NSIs; State of the art and comments.

A detailed analysis of the answers to the questionnaire can be found in the deliverable 2.2 of the project [6]. In the following we report only some main findings concerning tools and state of the art. First, an interesting outcome was the list of tools developed by NSIs shared or candidate to be shared. Starting from this list, a repository [7] of available tools, classified according to GSBPM, has been realized and it is actually managed by the Sharing Advisory Board [8]. As far as the state of the art collection, various items in the questionnaire defined as open questions, and the respondents were asked to either provide links and/or attach documents to the questionnaire itself. A total of 60 documents and 33 links were collected. The major outcomes of the state of the art analysis regarded (i) business process models currently adopted by NSIs and available mappings to GSBPM and (ii) IT strategies and principles as stated in official documents of the institutes. More specifically, a strong correspondence between adopted business process models and GSBPM was found for Austria, Finland, Netherlands, Norway, New Zealand, Sweden, United Kingdom, while a looser one for Switzerland, Czech Republic, and Hungary. As far as IT strategies and principles, it emerged a general lack of a statistical information system architecture (both IT and organizational) and the collection of principles highlighted by NSIs was provided as input to technical workpackages of CORA.

4. - Technical Architecture
In this section, we describe the principal results concerning CORA technical architecture. First, the principles underlying CORA design and the main idea of the proposal are described in Section 4.1. Then, in Section 4.2 some details on the CORA model are presented. Finally, Section 4.3 sketches some implementations showing the feasibility of the CORA proposal. 4.1 - CORA Design Starting from the requirements provided by WP2 a set of principles to be respected by the CORA model were stated: Platform Independence. Respondent NSIs use various platforms (e.g., hardware, operating systems, database management systems, statistical software, etc.), hence architecture is bound to fail if it endeavors to impose standards at a technical level. Moreover, platform independence allows a statistical process that complies with the model to be unaffected when the implementation of a service changes, without any modification of the conceptual specification of the service. Service Orientation. The vision is the production of statistics takes place through services calling services. Hence services are the modular building blocks of the architecture. By having clear communication interfaces, services implement principles of modern software engineering like encapsulation and modularity. Layered Approach. According to this principle, some services are rich and are positioned at the end of the statistical process, so, for instance a publishing service requires the output of all sorts of services positioned earlier in the statistical process, such as collecting data and storing information. The highest layer of the

architecture is supposed to deliver services to the outer world, consisting of government, universities, companies and people. The ambition of this model is to bridge the whole range of layers from digit to publication by describing all layers in terms of services delivered to a higher layer, in such a way that each layer is dependent only on the first lower layer. Starting from these principles, we designed the CORA model according to two dimensions, namely a functional dimension and a construction dimension. The GSBPM is currently being adopted as an international standard for the functional classification of activities in the statistical process. The set of requirements provided as by WP2 reveals that over two thirds of respondent NSIs have or are developing an official statistical business process and that a significant fraction thereof (27%) have realized a mapping of their model onto the GSBPM. This leads us to adopt the GSBPM as a source for the establishment of a functional subdivision of layers. More specifically, the functional dimension of the CORA model is composed by the nine functionality categories of GSBPM shown in Figure 3.

Figure 2: GSBPM sub-processes

Code F

Name Feature

Meaning A domain of interest documented by statistical products. Statistical series over time. Integrated or simple statistical product for a given time. A population at a given time. A statistical unit at a given time. A statistical variable at a given time. A logical representation of the value of a variable

Examples Inflation; Energy production and consumption Consumer Price Index 2000-2005; Balance Development 2000-2005. Consumer prices in the third quarter of 2005; Energy balance of 2005. Articles in the clothing industry purchased in week 41 of 2005; Electricity producers in 2005. A McGregor suit purchased at C&As on the 6 th of October 2005 at 16:45 for 256,25 euros; Electricity producer XYZ in 2005. The price in euros of a suit on the 6th of October 2005; The province in which an electricity producer is established in 2005. The number Holland. 256,25; The name South Energy

T S

Time Series Statistic

Population

Unit

Variable

Representation

Figure 3: Levels of the construction dimension

The second dimension in the layered architecture model is determined by the way services make use of one another to deliver their respective products. These products always consist of data and metadata. Data and metadata are always constructed by making use of data and metadata of a lower layer. This is why the dimension is called construction dimension. The levels of the dimension are shown in Figure 4. 4.2 - CORA Model This section presents the relationships between the concepts of the layered architecture model and illustrates them with a UML diagram (see Figure 5).
+belongs_to Layer level +contains +belongs_to +has Element
n

Service
1 n

+implements

Constructor
prescript

+input
n output.belongs_to.level = input.belongs_to.level + 1

+output n Construct +represented_by


1

Object

Figure

Time series

Statistic

Population

Unit

Variable

Figure 4: Class diagram of CORA model

The central piece of the model is an Element that can be of two types, namely a Construct and an Object. Most end products of a statistical office consist of values that cannot be measured directly, such as average incomes and inflation rates, because they describe things that cannot be observed, such as populations and price changes. These values have to be computed from other values, which are available for measuring, because they are descriptive of things that can be observed, such as people and purchased articles. Whatever cannot be observed has to be constructed: it has to be defined intentionally by referring to values that are available (whether they can be observed, or have to be computed as well), and has to be supplied with a prescript on how to obtain its values. In this model we call observable things Objects and non-observable things Constructs. A Constructor is a combination of data and metadata specified within the framework of a statistical activity in order to produce a Construct. A Constructor belongs to the same level as its output. Prescript is a general term that refers to the document formally

specifying (prescribing) the behaviour of the constructor. Every Construct is produced from Elements. These Elements can be Objects (e.g. the population level construct Population of North Holland in 2007 is constructed from unit level objects responding to the definition Person living in North Holland in 2007), but they can also be Constructs (e.g. the statistic level construct Consumer Price Index of 2007 is constructed from the population level construct Articles purchased in 2007). Input elements belong, by definition, in the first lower layer under the Construct they are used to produce. For instance, a Constructor producing population information (such as an average income) is on the Population level, and makes use of Elements of the Unit level (Persons), which can supply the incomes from which to compute the average income. The resulting construct is the output of the Constructor. The Constructor carries out its work by performing its Prescript. All data and metadata produced by statistical offices are created, according to this model, by Constructors. The Prescripts of a Constructor can be carried out by software tools, but also by human work. The instance (human or machine) doing the work of a constructor is called a Service. The Service is the point of contact for the activation of a Constructor. The Service takes care of checking and transferring parameters to the Constructor. The expected parameters are defined by the input elements. The actual values are supplied by the user of the service. There are two functional domains in which services have no constructors: Storage and Presentation. This is because these services are semantically neutral. They do not affect the meaning of data. They only effect formal transformations, which are reversible. Thus they are not responsible for constructing data, and need no constructors. The question may arise why we need both concepts of a service and a constructor: could the user not call the constructor directly? This is due to the requirement of implementation independence: a constructor tells what has to be done, and its prescripts tell how to do it on statistical terms, regardless of whether they are carried out by man or machine. There could be two different services one manual and one automated to carry out the prescripts of the same constructor. Another reason is the need for a unified description of services. For the sake of clarity and simplicity, we want to describe all services in the same way, whether they construct data or not. A constructor contains prescripts guiding the production of its construct. These prescripts are made of metadata describing the data used as input, the resulting construct and the activities to deploy in order to produce it. As mentioned, a constructor belongs to the same layer as its product which entails that metadata belong to the same level as the data they describe or for the production of which they supply the prescripts. In other words, the same architecture model can be used to describe both data and metadata. All metadata are managed and used within the service they define. A service knows its own metadata, and can supply its own model on demand. For example, a service that supplies data of the construct Person can also supply the model of this construct, because it is part of its construction prescript. This approach enforces the architectural guideline that states that there can be no data without metadata. Data and metadata reside side by side in the layer of the construct or object they belong to. Every metadata element belongs to a specific layer. This introduces an ordering that promotes optimal reuse.

4.3 - Implementation Prototypes Simple prototype systems have been built by Istat and by Statistics Netherlands. The two prototypes were actually implemented with the purpose of being proof-of-concepts of the proposed architectural model. The chosen platforms were Java and .Net for the purpose of showing the platform-independence principle stated for CORA. Istat implemented an API able to transform inputs and outputs of an R script from and to the CORA model. The API is entirely written in Java. Mappings and input/output files are written as XML files. The API strongly exploits Java reflection and dynamic programming so it is not tied to a specific application domain and can be used without recompilation. At present, implemented mapping types include CSV format (for reading and writing text files with separators) and SQL (for reading and writing from/to relational databases). The programmer can extend the API by providing custom mappings for any data format she needs to deal with. Statistics Netherlands implemented generic service interfaces built around two tools, namely Digros and R. The goal was to wrap those two tools with the same generic service interfaces and then let them exchange data using these interfaces. Two services were developed within the scope of CORA, wrapping respectively Digros and R. The Digros wrapping service was built upon the Microsoft.NET framework (with the C# programming language) while the R wrapping service was built within the R programming language and makes uses of existing R libraries. The two services can be invoked sequentially: the output of the call to a service (data export) is written to an SDMX file that can be taken as input by a subsequent call (data import).

5. - Organizational Architecture
In the CORA workpackage 4 we started to define the commercial and legal foundations for the exchange of software between statistical offices in Europe. Since these offices are not primarily intended to do marketing for their software packages although some of them do so it is obvious that the Open Source way is a viable alternative for such exchange activities. In this workpackage also business models fro software exchanging were analysed, starting from Barter, to Co-development and various kinds of Freeware and Open Source development and support. It was very important to study existing Licensing Models, considering that in many NSIs (and often in Public Administration bodies) it's very unusual to find this kind of knowledge. In these years there are two major groups of open licensing being used: the GNU type licenses and the Apache/BSD type licenses. The licence originated within the GNU (a recursive acronym for GNU's Not Unix) project founded by Richard M. Stallman at MIT, is the GNU General Public License (GPL) created in 1988-1989 [9]. It was described by Stallman with these words The General Public License is a copying license which basically says that you have the freedoms we want you to have and that you can't take these freedoms away from anyone else. The most important characteristic of the GPL license is the 'copyleft' which aims to preserve the freedom and openness of the software itself. GPL is aimed at giving the end-user significant permissions, such as

the permission to redistribute, reverse engineer, or otherwise modify the software, but the end-user must redistribute any modification under the same license (so called viral license). Apache/BSD licences are a family of permissive free software licences, originally used for the Berkeley Software Distribution (BSD) Unix developed in the University of California (Berkeley) and now used by Apache foundation, developer of the Web Server most diffused in the world [10] . These licenses have fewer restrictions compared to GPL, putting works licensed under them relatively closer to the public domain, i.e. content that is not owned or controlled by anyone - "public property". These licences grant the end-user permission to do anything they wish with the source code in question, including the right to take the code and use it as part of closed-source or proprietary software. In recent years two different definitions of open software licences confronted each other: the Free Software Definition, sponsored by the Software Foundation (FSF founded by Stallman) and the Open Source Definition, established by OSI [11] . These definitions was very useful to clarify licences characteristics, but, being an European project, we needed licences fitting into the legal environments of Europe and of the particular national legislation. So we evaluated the European Union Public Licence EUPL [12], a free software licence created and approved in 2007-2009 by the European Commission. The EUPL main goal is to be consistent with the copyright law in the 27 Member States of the European Union, while retaining compatibility with popular open-source software licences such as the GPL license. Another essential feature is the licence availability in the 22 official languages of the European Union, seen that national laws often require laws to be written in national languages. For this two main reasons EUPL is surely the best choice for free software projects developed inside Europe public administrations. Besides Licensing Models, in the project we analysed questions concerning further development and support including training for free software, describing different ways to use and improve this kind software for (statistics) administrations throughout Europe (and beyond). So the Barter model (goods/services exchanging on non-monetary basis), the Co-development between partners, the Freeware model were discussed, listing moreover different support kinds (local contracted, in-house, etc.) that can be given in free software development.

6. - From CORA to CORE


The CORE [13] (COmmon Reference Environment) ESSnet aims to continue the work of CORA which finished by end of October 2010. CORA has produced the definition of an architectural model together with proof-of-concept software prototypes: starting from CORA results, CORE will extend the architectural model by defining a complete information model that takes into account process modeling, and specifically definition of sub-processes and communication interfaces. CORE will then provide a software environment for the automated execution of statistical processes, defined according to the designed information model. CORE goes in the direction of fostering the sharing of tools among NSIs. Indeed, a tool developed by a specific NSI can be wrapped according to the extended CORA model (eCORA) and thus easily integrated within a statistical process of another NSI. Moreover,

having a single environment for the execution of entire statistical processes provides an high level of automation in process execution. The duration of the ESSnet CORE, started in 2011 January, is 13 months and it involves Istat (Italian National Institute of Statistics) as coordinator, CBS (Statistics Netherlands), SSB (Statistics Norway), INE (Portugal Statistics), SCB (Statistics Sweden), and INSEE (France) as participants. The first objective of CORE is to extend CORA model by defining the new information model e-CORA. Specifically, we will take the CORA model (including GSBPM) as a starting point, and turn it into a fully elaborated information model that covers business concepts, statistical goals and methods, as well as operational logic and implementation aspects. The role SDMX [14] can play in this model must also be taken into consideration. Specifically, a possible mapping between SDMX and the e-CORA model will be investigated in order to take into fully account results and tools already available for SDMX. The second goal is related to the analysis of a list of tools and to the study of the effort necessary to integrate such tools into the e-CORA model. In particular, starting from the inventory of tools that was prepared within CORA, we will select some tools to be used within some statistical processes executed in the CORE framework. Then we will perform a specific evaluation of the feasibility and of the cost necessary to integrate such tools. The third objective of the project is related to (i) the definition of a way of exchanging data between tools designed inside GSBPM sub-processes, and the (ii) development of components wrapping such tools in order to integrate them. CORA has defined a technical architecture for tools that has the following qualities: 1. Tools are encapsulated into statistical services that are well-defined in terms of GSBPM process and data aggregation level. These services can be understood by statisticians; 2. The communication is the same for all tools, regardless of their technical implementation. This means that a tool implementing the CORA protocol can be integrated into the infrastructure of each NSI that supports CORA, regardless of other tools, underlying operating systems and hardware. For these qualities to be realized by actual technical components, the following topics have to be addressed by CORE: Describe exactly what data have to be exchanged between tools (the data themselves as well as metadata and process data). For this, existing project results on these kinds of data will be examined and reused; Describe exactly how the data will be communicated (data format, communication protocols, necessary support for operating systems, DBMS, middleware and other elements of the infrastructure); Provide a way to define process to be executed in the framework; Provide a way to specify services implementing processes to be executed in the framework. The development of services will be performed by using two different technological solutions: Java and .Net. In this way, we will underline the technological independence of the CORE compliant services, by choosing an open source and a proprietary platform.

The fourth CORE objective is the analysis of possible middleware solutions to implement CORE compliant workflow. CORA describes what statistical processes look like, but only covers the IT aspects of how to translate individual process steps into technical services and service cores. The IT aspects of how to translate the interaction between process steps, which is concerned with process execution, workflow management, exception handling etc., has not yet been covered by CORA. The technical implementation of this service interaction, that is essentially the embodiment of the statistical process according to CORA, will give meaning to the statistical services that will be the result of CORE. This implementation will enable the NSIs to integrate these CORE services into working processes. This requires the following topics to be addressed by CORE: Investigate the requirements that CORA (as the business context) and the services defined by CORE (as the technical components) impose on the integration middleware; Elaborate on a few example scenarios based on concrete middleware solutions on how different NSIs could implement or use the required middleware and connect their services.

7. - Conclusions
In this paper we have summarized the results of the CORA project. Such results have been jointly produced by the CORA team and are fully detailed in the CORA deliverables [6]. CORA has produced the definition of an architectural model to be used as a reference by NSIs. Proof-of-concept software prototypes have also shown the viability of the proposal. A new ESSnet, CORE (COmmon Reference Environment), aims to continue the work of CORA. Starting from CORA results, CORE main objective is to provide a software environment for the automated execution of statistical processes. Defining a common environment can foster collaboration among institutions and stimulate the sharing of software tools.

Acknowledgement
In this paper we have summarized the results of CORA and CORE project, both funded by Eurostat. Such results have been jointly produced by the ESSnet team [14]. CORA results are fully detailed in the CORA deliverables [6].

References
[1] http://cora.forge.osor.eu/ [2] Collaborative European Statistical System Research Network http://epp.eurostat.ec.europa.eu/portal/page/portal/essnet/introduction [3] UNECE description of the model in various languages and other resources about GSBPM http://www1.unece.org/stat/platform/display/metis/The+Generic+Statistical+Business+Process+Model [4] In ESSnet projects the term metadata is used in the sense of statistical metadata according to METIS group specifications, laid out in Common Metadata Framework - see http://www1.unece.org/stat/platform/display/metis/The+Common+Metadata+Framework [5] Management of Statistical Information Systems http://www.unece.org/stats/archive/04.01a.e.htm [6] CORA deliverables (under Creative Commons license) http://cora.forge.osor.eu/Deliverables.htm [7] MSIS Software Inventory http://www1.unece.org/stat/platform/display/msis/Software+Inventory [8] Sharing Advisory Board http://www1.unece.org/stat/platform/display/msis/About+the+SAB [9] GNU General Public License http://www.gnu.org/licenses/gpl.html [10] Netcraft Web Server Survey http://news.netcraft.com/archives/category/web-server-survey/ [11] Open Source Initiative http://www.opensource.org/ [12] European Public License http://www.osor.eu/eupl [13] CORE ESSnet http://www.essnet-portal.eu/project-information/core [14] Statistical Data and Metadata Exchange http://sdmx.org/ [15] CORA team http://cora.forge.osor.eu/References.htm

Das könnte Ihnen auch gefallen