Sie sind auf Seite 1von 41

Fostering Analytics on Learning Analytics Research: the LAK Dataset

Davide Taibi
Institute for Educational Technologies National Research Council of Italy Via Ugo La Malfa 153, Palermo Italy

Stefan Dietze
L3S Research Center Appelstr. 9a 30167 Hannover Germany

davide.taibi@itd.cnr.it

dietze@l3s.de
allow the computational analyses of research publications in the emerging Learning Analytics field, we have published in a machine-readable format a comprehensive set of scientific papers in the field of learning analytics and educational data mining. This can serve as input for studies into scientometrics, investigations into the evolution of the overall discipline or correlations with other fields. The LAK dataset provides access to documents already available online in unstructured form but also to research works not publicly accessible before at all. The collection includes the proceedings of the International Educational Data Mining Society4, as well as the Journal of Educational Technology and Society a special issue on Learning Analytics5. In both cases full text has been freely available already but in unstructured format. Content which has previously been accessible to subscribers only includes the Proceedings of the ACM International Conference on Learning Analytics and Knowledge edited by ACM. In the framework of our initiative, ACM is providing freely its ACM Digital Library in the Learning Analytics field solely for research purposes. The following table describes in details the set of papers that have been collected, processed and exposed in a structured form in the Learning Analytics and Knowledge (LAK) dataset:

ABSTRACT
This paper describes the Learning Analytics and Knowledge (LAK) Dataset, an unprecedented collection of structured data created from a set of key research publications in the emerging field of learning analytics. The unstructured publications have been processed and exposed in a variety of formats, most notably according to Linked Data principles, in order to provide simplified access for researchers and practitioners. The aim of this dataset is to provide the opportunity to conduct investigations, for instance, about the evolution of the research field over time, correlations with other disciplines or to provide compelling applications which take advantage of the dataset in an innovative manner. In this paper, we describe the dataset, the design choices and rationale and provide an outlook on future investigations.

Categories and Subject Descriptors


H.3.4 [Semantic Web], I.2.4 [Ontologies]

General Terms
Algorithms, Documentation, Standardization. Design, Experimentation,

Keywords
Learning Analytics, Data, Educational Data Mining Linked Data, Semantic Web,

Table 1 : Papers included in the LAK dataset Publication Proceedings of the ACM International Conference on Learning Analytics and Knowledge (LAK) (2011-12) The open access journal Educational Technology & Society special issue on Learning and Knowledge Analytics: Educational Technology & Society (Special Issue on Learning & Knowledge Analytics, edited by George Siemens & Dragan Gaevi), 2012, 15, (3), pp. 1-163. Proceedings of the International Conference on Educational Data Mining (2008-12) Journal of Educational Data Mining (2008-12) # of papers 66

1. INTRODUCTION
As part of an international team of research practitioners consisting of the Society for Learning Analytics Research (SoLAR)1, ACM2, the LinkedUp project3, the Educational Technology Institute of the National Research Council of Italy (CNR-ITD), we have released an unprecedented resource for the Learning Analytics and Educational Data Mining. In order to
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

10

239 16

1 2 3

www.solaresearch.org/ http://acm.org/ http://linkedup-project.eu/


4 5

http://www.educationaldatamining.org/proceedings http://www.ifets.info/issues.php?id=56

Figure 1: Knowledge extraction process

2. FROM SCHOLARLY PAPER TO STRUCTURED DATA 2.1 The extraction process


In order to process and analyze the set of unstructured journals and conferences papers data was transformed into structured data. While each conference proceeding is available on the Web in PDF format, but each collection has its own structure. Even if in some cases the most used format is the ACM template6, papers not always comply with it entirely, calling for some specifically adapted extraction mechanisms. The overall knowledge extraction process is composed of three main steps: 1. 2. 3. Transforming PDF representation. Cleaning up information. and documents consolidation to of plain the textual textual

For each paper the following information are collected: title, authors, keywords, abstract, full-text and its relationship with the type of publication or event, (journal or conference proceedings). It is important to note that beside the common metadata for the learning analytics papers such as: title, abstract, authors and affiliation also the full text of the papers is stored in the dataset. At this stage the full text has been stored without considering its separation in paragraphs and sections, however the elaboration performed at step number 2 has also identified the titles and paragraphs of sections and subsections, thus providing the basis for analyzing full text with further granularity in next versions of the LAK dataset. The referenced papers are also extracted but are not made available in the LAK dataset in this version of the dataset .

2.2

The schema

Extracting structured data from text.

In the first step the PDF file containing the proceeding of a conferences, or the papers of a journal is split up in order to have one document for each paper. Then each PDF file has been elaborated with pdf2text tool in order to have a textual representation for each paper. In the second step the text files are elaborated in order to transform them in a partially structured format that can be elaborated automatically. In particular at this step tables and figures are removed from the paper, maintaining their captions, that can be useful for text mining processing, footnotes have been also removed from the text, while bulleted or numbered list have been organized using an homogeneous format. As part of the third step, text files are being processed in order to extract from them the most important sections of the document. Regarding the authors, their name, affiliation, country are represented using the FOAF ontology.

The schema used to describe the papers in the dataset is based on two established schemas: the Semantic Web Conference (SWC) ontology7 (already used to describe metadata about publications from the Semantic Web conferences and related events8) and the Linked Education schema9. The Linked Education schema has been developed to represent and catalog both educational and educational related datasets, which are datasets not specifically created for education but that can be used in an educational context. The schema has been used to annotate datasets and resources as part of an integrated dataset10 which contains educationally relevant resources such as : LinkedUniversities[1] and the mEducator Educational Resources [4] with their Open Educational Resources and materials explicitly related to education, as well as implicitly educationally relevant datasets such as BBC Programmes[3], ACM Library Metadata11 and Europeana [2] datasets. The main entities collected in the LAK dataset are paper authors, institutions and papers, related to the
7 8
9

http://data.semanticweb.org/ns/swc/ontology http://data.semanticweb.org/ http://data.linkededucation.org/ns/linked-education.rdf http://linkedup.l3s.uni-hannover.de:8880/openrdfsesame/repositories/linked-learning-selection?query http://acm.rkbexplorer.com/

10

http://www.acm.org/sigs/publications/proceedings-templates

11

learning analytics area. Authors and institutions have been represented using respectively the classes Person and Organization of the FOAF ontology, while to represent papers, the class InProceedings of the SWRC ontology has been used. The LAK Dataset, at the time of writing, includes 779 authors, connected to 295 institution, and 315 posters, abstract, short and full papers.

2.3 Access Methods


In order to support different access method for the data, the resources of the LAK datasets have been published in different formats: A dump file in zipped RDF/XML file format can be directly downloaded from the SoLAR research web page. A version of the dataset in a format that can be elaborated through the R statistic software have been provided12 A Linked Data endpoint with a public SPARQL endpoint has been developed in order to provide access to structured RDF metadata according to LOD principles.13

Conference, Leuven, Belgium (April 2013)18. The challenge is revolving around the overall question on what insights can be gained from analytics on the LAK corpus about the general discipline of Learning Analytics and its connection to other fields. How can we make sense of this emerging fields historical roots, current state, and future trends, based on how its members report and debate their research? Challenge submissions should exploit the LAK Dataset by covering one or more of the following, nonexclusive list of topics: Analysis & assessment of the emerging LAK community in terms of topics, people, citations or connections with other fields Innovative applications to explore, navigate and visualise the dataset (and/or its correlation with other datasets) Usage of the dataset as part of recommender systems

4. CONCLUSIONS
The LAK Dataset has been a first starting point to allow the analysis of leaning analytics as an emerging research discipline and its definition and evolution. Along with the growth of the research works in learning analytics and related fields, we intend to expand the dataset by adding new research publications. In addition, while the dataset currently contains plain metadata and full text of the research publications, it is envisaged to extract and add additional data about contained entities and topics, to provide simple means for assessing, exploring and navigating the data.

The following SPARQL query14 on the LAK dataset can be used to extract the full text of all 2011 papers (LAK 2011, and EDM 2011 conferences) in .srx format (XML file which can be opened in any text editor):
PREFIX led:<http://data.linkededucation.org/ns/linked-education.rdf#> PREFIX swrc:<http://swrc.ontoware.org/ontology#> SELECT ?paper ?fulltext WHERE { ?paper led:body ?fulltext . ?paper swrc:year ?year . FILTER (?year = 2011) }

5. ACKNOWLEDGMENTS
This work is partly funded by the European Union under FP7 Grant Agreement No 317620 (LinkedUp).

On the SoLAR Website some useful examples for querying the SPARQL endpoint15 of the LAK dataset have been reported16. The example queries allow users, for instance, to retrieve: the papers co-authored by two selected authors; all papers published in both EDM and LAK conferences by the authors affiliated to an institution.

6. REFERENCES
[1] Fernandez, M., d'Aquin, M., and Motta, E. 2011. Linking Data Across Universities: An Integrated Video Lectures Dataset. In Proceeding of the 10th International Semantic Web Conference (ISWC 2011), 23 - 27 Oct 2011, Bonn, Germany. [2] Haslhofer, B., Isaac. A. 2011. data.europeana.eu - The Europeana Linked Open Data Pilot. In Proceeding of the International Conference on Dublin Core and Metadata Applications (DC 2011). [3] Kobilarov, G., Scott T., Raimond Y., Oliver S., Sizemore C., Smethurst M., Bizer C., Lee R. 2009. Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Conections. In Proceedings of the 6th European Semantic Web Conference (ESWC2009). [4] Mitsopoulou, E., Taibi, D., Giordano, D., Dietze, S., Yu, H. Q., Bamidis, P., Bratsas, C. and Woodham, L. 2011. Connecting Medical Educational Resources to the Linked Data Cloud: the mEducator RDF Schema, Store and API, in Linked Learning 2011, Proceedings of the 1st International Workshop on eLearning Approaches for the Linked Data Age, CEUR-WS, Vol. 717, 2011

3. LAK DATA CHALLENGE


Beyond merely publishing the data, we are actively encouraging its innovative use and exploitation as part of a public LAK Data Challenge17 sponsored by the European Project LinkedUp. An initial competition is co-located with the ACM LAK13

12 13

http://www.r-project.org/ http://data.linkededucation.org/openrdf-sesame/repositories/lakconference?query=[your sparql query] http://data.linkededucation.org/openrdf-sesame/repositories/lakconference?queryLn=SPARQL&query=PREFIX%20led%3A%3Chttp %3A%2F%2Fdata.linkededucation.org%2Fns%2Flinkededucation.rdf%23%3E%0APREFIX%20swrc%3A%3Chttp%3A%2F% 2Fswrc.ontoware.org%2Fontology%23%3E%0A%0Aselect%20%3Fpa per%20%3Ffulltext%20where%20%7B%3Fpaper%20led%3Abody%2 0%3Ffulltext%20.%20%3Fpaper%20swrc%3Ayear%20%3Fyear%20. %20FILTER%20%28%3Fyear%20%3D%20%222011%22%29%20%7 D&infer=true http://data.linkededucation.org/openrdf-sesame/repositories/lakconference http://www.solaresearch.org/resources/lak-dataset/sparql-queries/ http://www.solaresearch.org/events/lak/lak-data-challenge/

14

15

16 17

18

http://lakconference2013.wordpress.com/

A Dynamic Topic Model of Learning Analytics Research


Michael Derntl
RWTH Aachen University Advanced Community Information Systems (ACIS) Aachen, Germany

Nikou Gnnemann
RWTH Aachen University Advanced Community Information Systems (ACIS) Aachen, Germany

Ralf Klamma
RWTH Aachen University Advanced Community Information Systems (ACIS) Aachen, Germany

derntl@dbis.rwthaachen.de ABSTRACT

nikou@dbis.rwthaachen.de

klamma@dbis.rwthaachen.de

Research on learning analytics and educational data mining has been published since the rst conference on Educational Data Mining (EDM) in 2008 and gained momentum through the establishment of the Learning Analytics and Knowledge (LAK) conference in 2011. This paper addresses the LAK Data Challenge from the perspective of visual analytics of topic dynamics in the LAK Dataset between 2008 and 2012. The data set was processed using probabilistic, dynamic topic mining algorithms. To enable exploration and visual analysis of the resulting topic model by LAK researchers and stakeholders we developed and deployed DVITA, a web-based browsing tool for dynamic topic models. In this paper we explore answers to the questions about past, present, and future of LAK posed in the Data Challenge based on a topic model of all papers in the LAK Dataset. We also briey describe how users can explore the LAK topic model on their own using D-VITA.

Figure 1: Yearly distribution of papers over venues questions posed in the LAK Data Challenge into the users hands. D-VITA is a web-based tool that oers topic-based views on the LAK Dataset using a pointand-click metaphor and simple visualizations.

2.

DATASET AND PREPROCESSING

1.

OBJECTIVES

The LAK Data Challenge called for contributions to make sense of the eld of learning analytics including its roots, current state, and future trends, based on how its members report and debate their research1 . This paper tackles the challenge by presenting facts obtained from statistical analyses of the paper full texts included in the provided LAK Dataset [7]. The main contributions are as follows: 1. A dynamic topic model was computed using the approach presented in [3]. Using this dynamic topic model we explore in Section 4 three questions about the evolution of topics in the LAK Dataset to distill knowledge about past, present and future of LAK research. 2. In Section 5 we describe the visual analytics application D-VITA2 , which puts the toolkit to answer the
1 2

The LAK Dataset underlying the analyses presented in this paper includes the EDM conference proceedings 20082012 (239 papers), the LAK conference proceedings 20112012 (66 papers), and the papers of the 2012 Special Issue on Learning Analytics in the Educational Technology and Society journal (10 papers; herafter referred to as ETS). The RDF representation of the LAK Dataset was processed by a script that extracted for each paper the identier, venue {LAK, EDM, ETS}, year of publication, title, authors, abstract, full text, and hyperlink to the full RDF description on data.linkededucation.org. The distribution of the 315 papers over time and venues is given in Figure 1. In the next preprocessing step the paper records were cleaned by removing stopwords and by applying stemming methods on the included word sets. For word stemming we used the Porter Stemming technique [6], which is well established for this purpuse. As a result, close to 5000 distinct word stems were identied as being used in the 315 papers.

http://solaresearch.org/events/lak/lak-data-challenge/ http://monet.informatik.rwth-aachen.de/DVita/?id=16

3.

DYNAMIC TOPIC MINING

Copyright 2013 by the authors

From a text mining perspective the LAK Dataset represents a text corpus in which a set of words is used in a set of papers. To identify what is relevant to LAK research, we used the dynamic topic modeling approach described in [3] to obtain the distribution of words over a pre-dened number of topics. This is a probabilistic, unsupervised machine learning approach that has been gaining increasing prominence recently [2]. In these probabilistic topic models a topic is a distribution of words, so each topic is typically represented

by its most frequently occurring words. Topic mining also obtains the distribution of these topics over the papers in the data set. Dynamic topic mining applies these analysis steps using several consecutive time slices in the data set. For the LAK Dataset, we chose the ve calendar years {2008 . . . 2012} as time slices. The results will thus reveal the evolution of topics over documents during those discrete time slices, and the evolution of words used in the papers for each topic over time. Dynamic topic mining requires the analyst to pre-set the number of topics. Based on previous experiments with varying numbers of topics in paper collections in well-dened subject areas, we decided to run the analysis of the LAK Dataset with a set of 20 topics. This number, while somewhat arbitrary, shall provide for sucient discriminatory power for both the distribution of topics over papers and the distribution of words over topics. With fewer topics, terms like learning, for instance, are more likely to be present with relatively high relevance in many topics, while a larger preset would increase the number of topics exposed in each paper. Both situations would impede reasonable interpretation and visualization of the results. A word of explanation regarding the labels used to refer to topics in this paper: mathematically each topic is a distribution over words. In a dynamic topic model this distribution changes over time, i.e. a specic word may rise or fall in relevance for a topic. In the rest of the paper we will therefore label each topic with an ordered tuple representing those words with the highest mean relevance for this topic over time. In topic modeling literature we found that four words is a good number to form a topic label. For instance, for topic students model parameters skill the most relevant word on average is student followed by model, parameters, and skill. For illustration, based on the word distribution for this topic in 2008 only, the label would be model student skill learning. Often, such word tuples are rephrased as more expressive labels; for instance student modeling could be appropriate in our example. The obtained topic model including 20 topics was analyzed to see whether the topics have sucient discriminatory power. To this end, we used the ten most important words for each topic and the corresponding probability distributions to compute a dissimilarity measure of the distributions by using the Jensen-Shannon divergence measure [5]. The matrix

with pairwise divergence values is displayed in Figure 2. The maximum Jensen-Shannon divergence value is ln(2) .69. The darker the cell color, the lower the divergence, thus the higher the similarity. The matrix is generally light-colored, indicating that the topics word distributions diverge to a high degree. Topic pair (A, S ) has the lowest dissimilarity value, and Figure 3 reveals why: both topics are about student modeling. Topic S generally appears to have several loosely related topics.

4.

ANALYSIS OF LAK TOPIC DYNAMICS

In this section we explore three questions about the LAK Dataset, intending to shed some light on the past and present topics of learning analytics research, along with a cautious glimpse into the future.

Question 1: What have been the most relevant topics overall in the LAK data set?
This question addresses the LAK Data Challenge aspects of roots and current state of learning analytics. Figure 3 shows an overview chart of the 20 topics identied in the LAK Dataset. The horizontal axis reects the rank of mean relevance of each topic and the vertical axis reects the rank of stability3 over the ve time slices in the dataset. The size of each bubble reects the relevance of the topic in 2012, the most recent period. We make several observations: The most relevant topics most prominently feature the terms students/learners, model, and data. This aligns well with SoLARs denition of learning analytics as the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs,[1] considering that understanding and optimization is necessarily based on models of learners and data. The topic with the highest mean relevance is student model parameters skill (A); this topic also has the highest variance in relevance. In the top-right quadrant we nd topic model data features prediction (B) which has a strong relevance in 2012, high mean relevance rank over all years and a high stability. As such, it can be considered as one of the core topics in the LAK Dataset. In 2012 the distribution of words in this topic would advocate the label prediction model data students, i.e. prediction is currently most relevant for this topic. Topic network community discussion analysis (R) is also worth looking at. While it is relatively irrelevant and volatile, it is among the relevant topics in 2012 (cf. the bubble size). The topic evolution chart in Figure 4 reveals that this topic accumulated most of its relevance in 2011, the year of the rst LAK conference. Also, 8 of the 10 papers with the strongest focus on this topic in 2011 were published in the LAK conference (see bottom portion of Figure 4) although EDM published 2.5 times the number of papers in that

Figure 2: Overview of topic divergence

3 Stability was computed by inverting the variance of the topics relevance over time

Figure 3: Topic stability plotted against average topic relevance over time. the topic, i.e. the more documents expose this topic. Since each document exposes dierent topics to varying degrees the relevance of topic k at time t is formally dened as 1 relevance(k, t) := |D dDt d [k ], where Dt is the set of t| documents belonging to time t, and d is the topic distribution for document d. Observing the ThemeRiver in Figure 5 it is evident that there were some shifts in topic focus during the years 2008 and 2010, where we have only the EDM papers in the dataset. Between 2010 and 2011 we identify the strongest turbulence, presumably based on substantial shifts in topic foci introduced by the 2011 LAK conference. Interestingly the topic distribution remains rather stable during the last time slice, in which LAK 2012, EDM 2012 and the ETS special issue are included. This might suggest that these three publication venues propelled the convergence of LAK research as represented in the LAK Dataset. To see which topics rose in relevance between 2010 and 2011 we lter for topics and zoom into the transition between 2010 and 2011 as illustrated in Figure 6. Those three topics that have their absolute highest relevance in 2011 are marked with an up-pointing triangle with a solid-black outline. These are model students data probability, network community discussion analysis, and problem students model types, indicating an increased focus on student modeling as well as community and network analysis through the rst LAK conference in 2011.

Figure 4: Evolution (top) and most representative papers in 2011 (bottom) of topic network community discussion analysis

year. This topic, in 2012 represented by the word order network community social user, therefore appears to be a genuine LAK topic which was previously rather irrelevant for the EDM conference.

Question 3: What topics rose the most in 2012, the most recent time slice in the data set?
This question looks into what the dynamic topic model of the LAK Dataset suggests as rising topics over the next year(s). We try to answer this by identifying those ve topics that had the highest rise in relevance between 2011 and 2012. The topic labels represent the word distribution in 2012, and the number in parentheses indicates the absolute gain in relevance:

Question 2: What changes in topic dynamics did the rst LAK conference in 2011 bring about?
This question aims to reveal whether and how the LAK community relates to the EDM community in terms of topics covered by their papers. To explore this we look (a) at at the overall distribution of topics over time and (b) at the relative change of topic relevance between 2010 and 2011. The evolution of the overall distribution of topics is illustrated in the ThemeRiver in Figure 5. In a ThemeRiver [4] the horizontal axis represents the points in time to which the documents in a dataset belong (in the LAK Dataset that is the publication date), and the vertical axis represents the relevance of the topic. Each current in the ThemeRiver therefore presents the dynamic development of a selected topic over time. The wider the current, the more relevant is

Figure 5: Overall distribution of topic relevance between 2008 and 2012

The Document and Word Evolution Panel shows for the selected topic an ordered list of the most relevant papers in the Relevant Documents tab. The icons next to each document allow showing the topic pie for the document and its content, respectively. The Similar Docs icon will bring up the Document Browser with a list of similar documents. Under the Word Evolution tab the user will nd a ThemeRiver illustrating the evolution of the distribution of words in the selected topic over time. D-VITA also oers a Document Browser to perform keywordbased search, explore the topic distribution of documents, and navigate documents based on similarity.

Figure 6: Topics with rising relevance in 2011 1. 2. 3. 4. 5. students data courses system (+.054) students interaction participants analysis (+.036) learning analytics social learners (+.035) students actions learning state (+.025) data user learning dataset (+.013)

6.

CONCLUSION

In a nutshell, we discovered the following: Regarding the past, we found that LAK and EDM do have a substantial shared topic foundation including themes like student modeling, data classication, and clustering. We also found that the EDM conference series had some turbulence in topical focus between 2008 and 2010, the time window when only EDM papers are present in the dataset. Regarding the present we found that the LAK Dataset exposes a strong emphasis on learner modeling, data modeling, analysis and prediction. The rst LAK conference in 2011 also brought some considerable shifts in topic focus; e.g. LAK 2011 has visibly strengthened network and social analysis aspects on top of EDM topics. Regarding the near future we found that the shifts in the topics proportions in 2012 appear rather moderate, thus indicating a phase of convergence of LAK research topics. Projecting recent topic shifts into the future, we can expect increased emphasis on social and interaction aspects and a sustained, strong role of students as research subjects.

In sum these ve topics have accumulated a share of 42% of the topic distribution by 2012, starting from 11% in 2008 (cf. Figure 7). These developments indicate a strong increase in focus on the students activities and actions in courses as well as social and interaction analytics.

Figure 7: Cumulative relevance of the top-ve rising topics 2012 over all years

7.

ACKNOWLEDGMENTS

5.

D-VITA TOPIC ANALYTICS TOOKIT

This work was supported by the European Commission through the the support action TEL-Map (FP7-257822) and the integrated project Layers (FP7-318209).

Except for Figures 1 and 3 all gures were produced using D-VITA, a web-based visual analytics tool we developed and deployed for visual analytics of dynamic topic models. The tool allows users to visually interact with the output of the dynamic topic mining algorithms on the LAK Dataset. The application window shown in Figure 8 has three panels: The Topics Panel shows the list of topics obtained by the dynamic topic modeling algorithm; topics can be sorted by rising, falling and mean relevance, as well as variance of relevance. The topics can be ltered using keywords; in the screen shot the keyword visual is used as a lter. The topic list thus only includes topics whose set of relevant words includes this word stem. Topics checked by the user will be visualized in the ThemeRiver in the Topic Evolution Panel. The Topic Evolution Panel shows a ThemeRiver of evolution of relevance of the topics selected in the Topics Panel. Data points for each topic and time slice, respectively, can be clicked, which will trigger the display of detailed information on the clicked topic at the selected time slice in the Document and Word Evolution Panel.

8.

REFERENCES

[1] About SoLAR, 2012. http://www.solaresearch.org/mission/about/. [2] D. M. Blei. Probabilistic topic models. Commun. ACM, 55(4):7784, 2012. [3] D. M. Blei and J. D. Laerty. Dynamic topic models. In ICML, pages 113120, 2006. [4] S. Havre, E. G. Hetzler, P. Whitney, and L. T. Nowell. Themeriver: Visualizing thematic changes in large document collections. IEEE Trans. Vis. Comput. Graph., 8(1):920, 2002. [5] J. Lin. Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory, 37(1), 1991. [6] M. F. Porter. An algorithm for sux stripping. Program, 14(3):130137, 1980. [7] D. Taibi and S. Dietze. Fostering analytics on learning analytics research: the LAK dataset, Technical Report, 03/2013, 2013. http://resources.linkededucation. org/2013/03/lak-dataset-taibi.pdf.

Figure 8: Application window showing ThemeRiver and document list (rotated image)

Socio-semantic Networks of Research Publications in the Learning Analytics Community


Soude Fazeli, Hendrik Drachsler, Peter Sloep
Open University of the Netherlands (OUNL) Centre for Learning Sciences and Technologies (CELSTEC) 6401 DL Heerlen, The Netherlands 0031-(0)45-576-2218

{soude.fazeli,hendrik.drachsler,peter.sloep}@ou.nl ABSTRACT
In this paper, we present network visualizations and an analysis of publications data from the LAK (Learning Analytics and Knowledge) in 2011 and 2012, and the special edition on Learning and Knowledge Analytics in Journal of Educational Technology and Society (JETS) in 2012.

2. Motivation
It is often difficult for conference attendees to decide which workshops or sessions are suitable and relevant for them. Therefore, a list of recommended authors and papers based on shared interests could be supportive to plan the conference participation more efficiently and effectively. There already exist several papers published regarding awareness support for researchers (Reinhardt et al., 2012; Fisichella et al., 2010; Ochoa et al., 2009; Henry et al., 2009) and scientific recommender systems (Huang et al., 2002; Wang & Blei, 2010) but none of them has analyzed the Learning Analytics datasets for this purpose yet. Our overall vision is to support the LAK attendees with a list of LAK authors and papers that are relevant for their own research interests. Such a recommendation could be created based on one or more of their own research papers but also on a short essay or even a tag cloud summarizing the research interest and objectives. Such a priority list can support the awareness of the attendees and empower the network of like-minded authors in the attendees particular research focus.

Categories and Subject Descriptors


H.3.3 [Information Search and Retrieval]: Information filtering; K.3.m [computers and education]: Miscellaneous

General Terms
Algorithm, visualizations

Keywords
Network, recommender, visualization, dataset, learning analytics, degree

1. Introduction

The Society for Learning Analytics Research (SOLAR) provided a dataset to solicit contributions to the LAK data challenge2 sponsored by the FP7 European Project LinkedUp3. The dataset contains research publications in learning analytics and educational data mining for the years 2010, 2011, and 2012 (Taibi & Dietze, 2013). An overview of the dataset is shown in Figure 1. The dataset contains in total, 173 authors and 76 papers from the LAK (Learning Analytics and Knowledge) conference series in 2011 and 2012, and the special edition on learning and knowledge analytics in the Journal of Educational Technology and Society (JETS) in 2012. We found 24 authors who contributed to all three scientific proceedings. Having access to a dataset always offers new opportunities, particularly in the educational domain, that lacks public datasets for running experimental studies (Verbert, Drachsler, Manouselis, Wolpers, Vuorikari, & Duval, 2011). Therefore, we used this dataset to present visualization of the authors and papers network, and to carry out a deeper analysis of the generated networks. Our overall aim is to use such a graph of authors and papers to recommend similar items to a target user. In the following sections, we evaluate the suitability of the LAK dataset for this purpose.
1 2 3

Figure 3. The used datasets In this paper, then, we aim to explore and identify like-minded authors within the LAK dataset. Supposing that we have a network of all the LAK authors and papers, the main research questions are: RQ1. How are the authors connected and which authors share more connections and are more central in terms of sharing commonalities with the others? RQ2. How are the papers connected to each other in terms of similarity? To answer these questions, we went through two main steps in our analysis: 1. Finding patterns of similarity between authors and

http://www.solaresearch.org/ http://www.solaresearch.org/events/lak/lak-data-challenge/ http://linkedup-project.eu/

papers, 2. Visualizing networks of the LAK authors and papers. We will now describe each step in detail.

4.1. The LAK authors network


Figure 2 presents a network of the LAK authors in which red nodes represent the authors and the edges show the similarity between the publications of two authors. The result shows how the LAK authors are connected in terms of their publications' commonalities. Moreover, the network shows the users who share more commonalities than do other authors. We call them central authors. In the next section, we show how they are connected with the other authors in the network.

4.2. The LAK authors degree centrality


For some node in the network, the degree centrality shows the total number of incoming and outgoing edges. It is a metric commonly used for Social Network Analysis (SNA) (De Liddo, Buckingham Shum, Quinto, Bachler, & Cannavacciuolo, 2011; Gueret, Groth, Stadler, & Lehmann, 2012; Opsahl, Agneessens, & Skvoretz, 2010). In other words, the degree of a node describes how many other nodes are connected to the target node. In fact, it helps to measure how many hubs are in the network. We describe hubs as the nodes that have the most connections to the others in the network. The degree centrality metric may be used to strengthen a network by providing its nodes with more connections. In this data study, degree centrality is used to measure the relevance of an authors papers to the other authors in the network.
!() !"! !") !)) #' #" ,-!) ,-%

Figure 2. The LAK authors network


(The Appendix shows a larger version)

3. Data processing
To find relationships between authors, we first computed the " similarity of the papers with the TF-IDF algorithm. TF-IDF can create a weighted list of the most commonly used terms in research articles. To generate the TF-IDF matrix for the LAK dataset, we first converted the LAK data from RDF to text files, # which is an accepted format for the Mahout system. Then, we ran the default TF-IDF algorithm provided by Mahout on the text files. We removed the stop words by setting the configuration variables within Mahout to 90%. Thus, if a word appears in 90% of the document, it is considered as a stop word (e.g. and, or, the, etc.) and is removed from the similarity matrix. As a final outcome we had: A so-called dictionary of all the terms in the LAK dataset A binary sequence file that includes the TF-IDF weighted vectors

$% &! '( %% (' (% *' %) *% "" !& !' (# (% ((

'$0#1(##

$) ') () ") ) +! +" +* %& %%

+(

+%

+'

+&

+$

+#

+!)

For computing similarity between the LAK authors, we used the T-index algorithm (Fazeli, Zarghami, Dokoohaki, & Matskin, 2010) as a collaborative filtering recommender algorithm that generates a graph of users. In it the nodes are users and the edges show the relationship between users that originates from similarity of user profiles. The T-index algorithm originally makes recommendations based on the ratings data of users. We extended the T-index algorithm to be able to process tags and keywords $ extracted from the linked data e.g. RDF files. We used Jena APIs to process RDF files and to handle Ontology Web Language (OWL) files that describe the generated graph of authors and papers. Jena helps to develop semantic Web application and tools.

!"#$%&'()*%*#$%+#$*(,-%,.*"/()

Figure 3. The degree centrality of the top ten central authors Figure 3 shows the degree centrality for the first ten authors with the highest similarity degree with respect to the LAK%publications. The horizontal axis (x) shows the top ten central users, e.g. u1 is the author whose paper(s) has the highest degree. The vertical axis (y) shows the degree values that describe the number of relationships of a each user shown in the x-axis. Figure 3 also shows degree centrality for two different sizes of nearest neighborhoods (n). Such neighborhoods are commonly used in collaborative filtering recommender algorithms. By increasing the neighborhood size n, the degree of the authors increases accordingly. As a result, we will have a larger number of central authors when n is higher (e.g. n=10). As can be seen in Figure 3, degree for the first central author (u1) is equal to 121 if n=10 and 97 if n=5. These high scores show the high relevancy of u1s publications to the authors. As a consequence, u1 will appear in the top-n authors recommendations more often than the other authors.

4. Data visualization
We visualized the generated graphs of authors and papers with the Welkin7 tool. Welkin takes an OWL file as input and provides visualization of the data as output. We present visualizations of the LAK authors and the LAK papers generated by Welkin in the following sub sections.

4
5 6 7

http://en.wikipedia.org/wiki/Tfidf http://mahout.apache.org/ http://jena.apache.org/ http://simile.mit.edu/welkin/

5. Discussion and conclusions


The results presented here, allow us to answer our research questions in the following way: RQ1. How are the authors connected? Which authors share more connections and are more central in terms of sharing commonalities with the others? We presented a visualization of the authors network to provide an overview of how they are connected to each other. To justify the authors connections and relationships, we evaluated the degree centrality for the first ten, most central authors. Table 1 presents the first ten central authors and their degree to show the authors with the highest relevancy of their publications with others in the network. Table 1 shows the degree of the authors for sizes of neighborhoods equal to 10. Table 1. The first ten central authors Author Hendrik Drachsler Kon Shing Kenneth Chung Wolfgang Greller Javier Melenchon Brandon White Vania Dimitrova
*' *% "( ")

!
Figure 4. The LAK papers network (The Appendix shows a larger version)

4.3. The LAK papers network


Figure 4 shows a network of the LAK papers. The red nodes are papers and the edges between them represent the similarity of the papers. By finding similar papers, we can recommend the most similar papers to specific authors. This increases the awareness of the authors about papers which are relevant to them and published in their communities. Figure 4 shows that, some of the papers share more similarity with the others and own a higher degree number. As with the central authors, these papers will appear more often in the top recommendation list than the other papers of the dataset. One may interpret their degree as their popularity. Therefore, the papers with higher degree values are more popular and, presumably, they are more of interests to users. For the publication data, interests of users derives from the words and terms they have used more frequently in their papers.
(# )# *#

Degree 116 87 80 66 59 50 45 44 40 39

Erik Duval
"$ !" !! %& !% %' $( %( ,-* ,-%# $) %)

+&,)&&

"# !# $# %# #

!"

!# $" $! $%

Rebecca Ferguson Anna Lea Dyckhoff Simon Buckingham Shum

$#

+%

+$

+!

+"

+*

+)

+(

+'

+&

+%#

!"#$%&'$#(#&)*

Figure 5. The degree centrality of Top ten papers

RQ2. How are the papers connected to each other in terms of similarity? We presented degree centrality of the LAK papers to give insight in their relationships in the papers visualized network. We selected the top ten papers that have the highest similarity with the other papers. To show which papers are placed in the top ten papers list, we present the title and authors for each paper. The top ten papers are not necessarily by the authors who are identified as the central authors. Although most of the central authors also appear in top ten papers list (see Table 2), the order is not the same. As we investigated the LAK data, we found out that some of the central authors have more than one paper. For instance, Hendrik Drachsler has contributed to four papers. In this study, similarity is calculated based on all papers of an author. So, it is quite probable that not each and every one of the authors papers individually has the highest similarity to the other papers. Although some of the central authors are common to the two

4.4. The LAK papers degree centrality


Figure 5 shows the degree centrality for the first ten papers that are most similar to the other papers. We selected the first ten top papers with the highest degrees. The horizontal axis (x) shows the top ten papers e.g. p1 is the paper with the highest similarity and thus, the highest degree value among the others shown by the vertical axis (y). Figure 5 shows degree centrality for two different sizes of nearest neighborhoods (n), 5 and 10. By increasing the n, the degree of the papers increases accordingly. As a result, we will have a larger number of top papers if n is higher (here, when n=10). In Figure 5, the degree for the first top paper (p1) is equal to 53 (n=10) and 29 (n=5). This shows how much p1 shares similarity with other papers. As a consequence, p1 can be considered as the most popular paper and it has the highest chance to appear in the top paper recommendations.

tables, only one of the papers authored by those central authors appears in the top ten papers list shown by Table 2. Table 2. The Top ten papers

6.

References

Paper
Learning Dispositions and Transferable Competencies: Pedagogy, Modelling and Learning Analytics The Pulse of Learning Analytics Understandings and Expectations from the Stakeholders Social Learning Analytics: Five Approaches Multi-mediated Community Structure in a Socio-Technical Network Modelling Learning & Performance: A Social Networks Perspective Teaching Analytics: A Clustering and Triangulation Study of Digital Library User Data Monitoring Student Progress Through Their Written "Point of Originality" Learning Designs and Learning Analytics A Multidimensional Analysis Tool for Visualizing Online Interactions Using computational methods to discover student science conceptions in interview data

Authors
Simon Buckingham-Shum, Ruth Deakin Crick

De Liddo, A., Buckingham Shum, S., Quinto, I., Bachler, M., & Cannavacciuolo, L. (2011). Discourse-centric learning analytics Conference Item. LAK 2011: 1st International Conference on Learning Analytics & Knowledge. Banff, Alberta. Fazeli, S., Zarghami, A., Dokoohaki, N., & Matskin, M. (2010). Elevating Prediction Accuracy in Trust-aware Collaborative Filtering Recommenders through T-index Metric and TopTrustee lists. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, 2(4), 300 309. doi:doi:10.4304/jetwi.2.4.300-309 Gueret, C., Groth, P., Stadler, C., & Lehmann, J. (2012). Assessing Linked Data Mappings using Network Measures. Proceedings of the 9th international conference on The Semantic Web: research and applications (pp. 87102). Springer-Verlag Berlin, Heidelberg. doi:10.1007/978-3642-30284-8_13 Opsahl, T., Agneessens, F., & Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks, 32(3), 245251. doi:10.1016/j.socnet.2010.03.006 Taibi, D., & Dietze, S. (2013). Fostering analytics on learning analytics research: the LAK dataset. Verbert, K., Drachsler, H., Manouselis, N., Wolpers, M., Vuorikari, R., & Duval, E. (2011). Dataset-driven research for improving recommender systems for learning. Proceedings of the 1st International Conference on Learning Analytics and Knowledge (pp. 4453). ACM, New York, NY, USA.

Hendrik Drachsler, Wolfgang Greller Rebecca Ferguson, Simon Buckingham-Shum Dan Suthers, Kar Hai Chu

Walter Christian Paredes, Kon Shing Kenneth Chung Beijie Xu, Mimi M Recker Johann Ari Larusson, Brandon White Lori Lockyer, Shane Dawson Eunchul Lee, M'hammed Abdous Bruce Sherin

Overall, we found that the LAK dataset can help conference attendees to become more aware of their research network, which, in its turn, is useful for sharing knowledge and experiences. However, the current dataset contains no user feedback or evaluations to evaluate either an author or a paper recommender system in terms of common metrics such as prediction accuracy and coverage of the generated recommendations. For future analysis it would be helpful if the LAK dataset also contains references to the papers. The references could be used to identify the top cited authors and papers within the LAK dataset and beyond. As a further step, we are planning to try additional social network analysis measures besides degree, such as betweenness or closeness.

7. Appendix 7.1. The LAK authors network

7.2. The LAK papers network

Paperista: Visual Exploration of Semantically Annotated Research Papers


Nikola Milikic
Faculty of Organizational Sciences, University of Belgrade Jove Ili!a 154 Belgrade 11000, Serbia +381-11-3950853

Uros Krcadinac
Faculty of Organizational Sciences, University of Belgrade Jove Ili!a 154 Belgrade 11000, Serbia +381-11-3950853

Jelena Jovanovic
Faculty of Organizational Sciences, University of Belgrade Jove Ili!a 154 Belgrade 11000, Serbia +381-11-3950853

nikola.milikic@gmail.com Bojan Brankov


UZROK Labs 107 Nehruova Belgrade 10070, Serbia +381-63-581879

uros@krcadinac.com Srdjan Keca


UZROK Labs 107 Nehruova Belgrade 10070, Serbia +381-61-3115661

jeljov@gmail.com

bb@uzrok.com ABSTRACT

sk@uzrok.com
data in order to facilitate and enhance educational process, and contribute to the overall improvement of students learning experience [17]. Even though both LA and EDM are selfcontained research fields, they are intertwined and overlap in topics they cover. They share many similarities, but also have some distinct differences as discussed by Siemens and Baker [18]. One of the similarities emphasized by these authors is that both fields reflect the emergence of data-intensive approaches to education, where both communities have the goal of analyzing large-scale educational data in order to support research and practice in education. They differ in the level of automation they aim to achieve. In particular, EDM has a greater focus on automating support for educational processes, such as adaptation and personalization of learning environments and learning processes. On the other hand, LA has a considerably greater focus on leveraging human judgment, on informing and empowering instructors and learners to reflect over and improve learning processes. The Society for Learning Analytics Research (SoLAR) has published LAK dataset1 containing structured data about research publications from Learning Analytics and Knowledge (LAK) Conference, Educational Data Mining Conference, and Journal of Educational Technology & Society (JETS) Special Issue on LAK. The data are represented in the RDF form, which makes them easy to integrate and process by applications. In this paper, we propose an approach to visualizing and exploring the LAK dataset. It is centered around the topics covered by the papers from the dataset, and is intended to give an overall view of the topics that LA and the EDM fields cover. As the focus of researchers and the degree of relevance of particular topics have been changing over years, our approach tries to show a trend of those changes through the whole period the dataset covers, namely from 2008 to 2012. It also allows for topic-based exploration of research papers and easy navigation to them.

We consider the problem of visualizing and exploring a dataset about research publications from the fields of Learning Analytics (LA) and Educational Data Mining (EDM). Our approach is based on semantic annotation that associates publications from the dataset with Wikipedia topics. We present a visualization and exploration tool, called Paperista (www.uzrok.com/paperista), which presents these topics in the form of bubble and line charts. The tool provides multiple views, thus allowing users to observe and interact with topics, understand their evolution and relationships over time, and compare data originating from different research fields (i.e., LA and EDM). Moreover, user can explore papers to which the presented topics are related to, and make related Web searches to access the papers themselves.

Categories and Subject Descriptors


D.2.2 [Software Engineering]: Design Tools and Techniques user interfaces

General Terms
Algorithms, Design

Keywords
Learning Analytics, Visualization, Research Papers

1. MOTIVATION
The field of Learning Analytics is emerging in the past few years and attracting more and more researchers from other areas of Technology Enhanced Learning (TEL). It aims to address the current needs in the broad area of education by making use of the latest trends in information technologies where everything is moving towards Big Data and real-time analytics. Learning Analytics (LA) is defined as the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimising learning and the environments in which it occurs [16]. It is often equated with other similar fields in the TEL area, such as Academic Analytics or Educational Data Mining (EDM) [14]. EDM is a research field that focuses on using computational approaches, namely data mining and machine learning, to analyze educational

2. RELATED WORK
In [11], authors present an interesting work aimed at automating the creation of relations between research areas by using semantically annotated data about research papers in a particular
1

www.solaresearch.org/resources/lak-dataset

area. As a continuation of this work, the same authors have created a tool, called Rexplore2, which, among other things, visualizes authors migration patterns across research areas [15]. In terms of visual representation, we find interesting an approach to visualization of tags (topics) and categories of tags over time. For example, Dubinko et al. [4] consider the problem of visualizing the evolution of Flickr tags. The authors present a new slider-based approach based on a characterization of the most interesting tags. A Flash-based animation in a web browser allows the user to observe and interact with the tags. Zhang et al. [5] present an approach to classification and visualization of temporal and geographic tag distributions. The authors argue that their approach can help humans recognize semantic relationships between tags. Lemma [6] presents the Ebony system, an application for browsing, navigation, and visualization of the DBLP database. Wattenberg [7] introduces arc diagrams for representing complex patterns of repetition in string data. Watteberg application, the Shape of Song, visualizes music files, creating a static representation of repetition throughout a time series. However, to our knowledge, there has been no (published) research work on the visualization of research topics and publications in the areas of LA and EDM.

3. THE PAPERISTA SYSTEM


Our approach is illustrated through a Web application called Paperista. The application visualizes topics associated with research publications from the LAK dataset, allowing users to browse through papers, compare LA and EDM research fields, and make related Web searches. Visualizations are created for each individual year in order to display relevant topics in the LA and EDM fields for a specific year, but also for all years combined in order to give an overall depiction of the topic distribution in these research areas.

Spotlight4. The decision to use Wikipedia based annotator was motivated by the fact that Wikipedia is the largest corpus of open encyclopedic knowledge and is often used as a well established large-scale taxonomy [8]. Both annotator services are designed to look for and retrieve recognized Wikipedia concepts from the given text. They can be configured to the specific needs of any particular usage scenario (i.e., corpus). TagMe is designed to identify Wikipedia concepts specifically in short texts. Its REST API5 allows for configuration of two parameters: i) the rho parameter which refers to the "goodness" of an annotation with respect to the topics of the input text, and ii) the epsilon parameter which is used for fine-tuning the disambiguation process and indicates whether to favor the most-common topics or to take the context more into account [9]. DBpedia Spotlight annotates a given text with concepts from DBpedia, a structured representation of Wikipedia [12]. DBpedia Spotlight REST API6 exposes two parameters: confidence of the annotation process that takes into account factors such as the topical pertinence and the contextual ambiguity; support parameter specifies the minimum number of inlinks7 [10]. We used only paper title and abstract for topic extraction, based on an assumption that these two elements contain mentions of the most important and interesting topics a paper is related to. In order to decide which service for semantic annotation to use, the two services were tested with a random sample comprising 5% of all papers and with different parameter settings. The best results were achieved by the TagMe service (rho=0.15; epsilon=0.5). For this reason, TagMe service was employed to annotate all papers in the corpus.

3.1.2 Identifying Popular Topics


Once having all papers associated with topics, we calculated the significance of each topic. Numerical statistic called TF-IDF (Term Frequency Inverse Document Frequency)8 was used as it calculates how important a word is to a document in a corpus of documents. This metric was adapted to our case and used to calculate the importance of a topic in a paper. Instead of calculating the frequency of a word, we calculate the frequency of a topic. Since Paperista allows for visualizing topics in a specific year and overall (in all years, 2008-2012), the significance was calculated for corpora containing papers from each of these different time periods. Accordingly, we had six different corpora and calculated the significance of a topic for each corpus. In order to present only the most significant topics, we have filtered the topic set to only those whose significance for a particular period was over 0.01. This threshold was empirically chosen and presents the best balance between the relevance of topics and their presentation in the Paperistas visualizations (i.e., assuring easy comprehension by users).

3.1 Data Preparation and Analysis


LAK dataset consists of data about conferences and journal papers published in the LA and EDM research fields in the 2008-2012 period. For each paper, the following elements are available: title, author(s), abstract, keyword(s) and full text. Also, basic information about authors is available, such as name and affiliation.

3.1.1 Topic Extraction


Since one of the main features of Paperista is visualization of research topics relevant for the given corpus, the first step in the data preparation process was to extract main topics of the papers encompassed by the LAK Dataset. A straightforward approach was to use keywords associated with the papers. This is because the authors themselves have compiled those keywords, and it is them who know the best which topics describe their work in the most appropriate way. However, the downside of this approach is that those keywords are given as free form text and are not consistent with any existing formal vocabulary. This makes them inconsistent throughout the corpus. Furthermore, the dataset is incomplete in regard to keywords as for conferences EDM 2008, 2009 and 2010 no keywords are provided. Thus, we decided to employ a service for semantic annotation in order to detect paper topics. We took into consideration two Wikipedia based semantic annotators: TagMe3 and DBpedia
2 3

3.1.3 Topic Cleaning


Even though the output of TagMe service consisted of topics that are relevant to the papers content, some of them can hardly be considered as relevant research topics in the LA and EDM fields as they are too general. For instance, topics like Methodology,
4 5

http://technologies.kmi.open.ac.uk/rexplore http://tagme.di.unipi.it

http://spotlight.dbpedia.org http://tagme.di.unipi.it/tagme_help.html 6 http://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Webservice 7 Inlinik, or inline link, are incoming links from other DBpedia concepts to the observed DBpedia concept 8 http://en.wikipedia.org/wiki/Tf-idf

Research, and Experiment can be associated with almost every paper in this corpus. Actually, these topics can be related to research papers from almost any other research area. Similarly, some of the retrieved topics were not relevant to research papers from the LAK dataset. Such topics resulted from imperfection of the TagMe tool (and semantic annotation tools, in general). Some examples of these alien topics include The T.O. Show, Ade Easily, Henry Snapp, etc. For instance, Henry Snapp topic was apparently mistaken with the SNAPP tool9, a popular learning analytics and visualization tool. Hence, it was important to detect and exclude all these generic and alien topics from the final visualization in order to reduce the noise. We applied topic cleaning approach similar to [11]. The idea is to identify topics that have little or no relationships with other topics in the corpus. This can be an indicator that a topic is too specific or alien to our set of identified topics and thus can be considered as an exclusion candidate. On the other hand, if a topic has relationships with too many other topics, this can be an indicator that a topic is too generic and again should be considered as an exclusion candidate. In order to detect these outlier topics, we needed a measure of relatedness between topics. To that end, we used the Wikipedia Miner10 service that calculates semantic relatedness of two topics by finding the corresponding Wikipedia articles, and calculating similarity of those articles by comparing their incoming and outgoing links [13]. Wikipedia Miner has a REST API11 that allows for retrieving this information programmatically.

Once having relatedness calculated for all the topics in our corpus, we compiled two lists to help us detect removal candidates. In the first list, each topic was associated with the number of other topics that topic is related to. This gave us an insight into which topics can be considered too general/specific (the higher the number of related topics, the more generic the topic is, and vice versa). In the second list, each topic was associated with a sum of its relatedness with all the other topics. This list was meant to complement the first one. The rationale here is that there might be a topic with fair number of relations to other topics, but those relatedness values are weak. This behavior also qualifies a topic to be considered as too specific or alien. The initial idea with compiling these two lists was that topics to be removed will be at the beginning and the end of the lists (top and bottom 10%), and that they could be removed automatically. However, by examining the lists, among the obvious exclusion candidates, there were also several topics that should not have been excluded. For instance, topics like Online tutoring, Process mining, Educational data mining etc. were at the end of both lists making them removal candidates, even though these topics are obviously highly relevant for LA and EDM fields. The reason for this lays in the nature of Wikipedia itself and the fact that not many other articles in Wikipedia link to these topics. Thus, the topic removal process could not be done completely automatically and an expert in the area was consulted to mark the topics that should not be excluded.

Figure 1 - Paperista Interface


9

http://www.snappvis.org http://wikipedia-miner.cms.waikato.ac.nz 11 http://wikipedia-miner.cms.waikato.ac.nz/services


10

3.2 Data Visualization and Exploration


The topic visualization applied in Paperista is inspired by the New York Times visualizations Four Ways to Slice Obamas 2013 Budget Proposal [1] and At the National Conventions, the Words They Used [2]. The Paperista visualization includes bubble and line charts, allowing users to gain insights into topic trends within the LA and EDM fields. Bubble charts show the importance of a certain topic for the entire dataset, each year, and/or each field. By changing different views, users can watch the changes within the dataset and compare the two fields. Animated transitions between charts help users understand these processes. In addition, since the animation does not show precise changes in topics relevancy (calculated using TF-IDF metric, see Sect. 3.1), users are also presented with a relevancy line chart for each topic. The user interface (Figure 1) consists of an animated bubble cloud, two button sliders, a sidebar, and an optional timeline. The first slider button (All Years / By Year) allows users to choose between the All Years and By Year views. All Years view presents relevant topics for the entire corpus of publications. By Year view activates a timeline, showing relevant topics for each year. By using the slider, users can follow the change in topic relevancy through the years the data is available for (2008-2012). The second slider button (All Topics / Group Topics) allows for grouping and regrouping of topics. All Topics view shows one

circle-shaped bubble chart. Group Topics view divides the chart into two groups of bubbles. The first group presents topics that appear only in the EDM field. The third one shows topics related only to the LA field (i.e., LAK and JETS publications). The group in the middle shows mixed topics, i.e., those that appear at least once in both EDM and LAK/JETS. Different views of the bubble chart are presented on Figure 2. The size of a bubble represents the topics relevancy (i.e., TFIDF value). Two research fields, EDM and LA, are color-coded. Each bubble is divided into two slices the size of which corresponds to the frequency of that topic within publications of each of the two sources. For the years 2008-2010, the dataset contains data only for the EDM research field, so the bubbles are one-colored. The order of topic bubbles is intended to help users compare the two fields. The leftmost bubbles represent mostly EDM-related topics, while the rightmost bubbles mostly belong to the LA field. Moreover, clicking on a bubble creates a line chart in a sidebar. The line chart shows the growth and decline of a certain topic. In addition to the visualization, the Paperista application allows users to browse papers by topic. When a user clicks on a particular bubble (topic), a list of papers related to that topic appears in the right sidebar (represented by a title and a list of authors). Clicking on the particular paper opens a link to Google scholar with a name of the article as a search query. Thus, if a paper is available online, a user could easily obtain the paper using the Paperista system.

Figure 2 - Different views of the bubble chart: (1) All Year / All Topic (no highlights); (2) All Years / Group Topics (with highlights); (3) By Year (2009) / All Topics (no highlights); and (4) By Year (2012) / Group Topics (with highlights)

Furthermore, when a user hovers over the paper title, all topics related to that paper become highlighted. By hovering over papers, users can gain quick insight about topic connections between publications. Users can also distinguish papers annotated with highly relevant topics from those marked with insignificant ones. This can show which papers are more related to the fields of EDM and LA, and which can be viewed as outliers.

both fields. This suggests that the similarities between the two fields are significant as they share many research topics.

5. CONCLUSION
In this paper we have presented our approach to visualizing topics and their trends in the LA and EDM fields. Our application allows for easy identification of the main topics researchers in these fields have been focusing on, and also exploration of papers related to those topics. When compared to other similar tools that provide visualization of research topics, our tool is the most similar to the previously mentioned Rexplore tool. However, while Rexplore is more focused on relations between authors and topics in research areas, Paperistas focus is on research topics and their trends over time. Also, Paperista allows for exploring papers related to different topics. Future work for Paperista will be primarily directed towards extending the system to support other datasets, related to other research areas. Since the LAK dataset is RDF-based, Paperista can easily be expanded to support other RDF-based datasets expressed using the same or related vocabulary, such as the Semantic Web Dog Food corpus12. Regarding the interface, we plan to introduce keyword-based search functionality for searching a topic by its name. This would allow for easy navigation to a desired topic and filtering papers related to it. The final goal for Paperista is to become a universal visualization tool for research papers.

3.3 Paperista Architecture and Dataset API


The Paperista system consists of a Web application and a server application that provides RESTful API for communicating with the dataset. The Web-based visualization is written in D3, a JavaScript library for manipulating documents based on data [3]. We have chosen D3 because of its good performance for animation and interaction within the Web environment. The visualization is available at the following address: www.uzrok.com/paperista. All data about conference topics and their significance (explained in Section 3.1) is available as a part of Paperista Dataset API. This API supports a REST model for accessing the data and it is available at: http://147.91.128.71:9090/LAKChallenge2013. The Paperistas Web application calls these operations in order to access data from the dataset (for example, a click on a topic triggers a call to the API, which returns a list of papers).

4. DISCUSSION
When looking at the view displaying topic distribution in all years (Figure 2.1), one can observe that EDM conference dominates in almost all topics. This is due to the fact that EDM conference is being organized longer than the LAK conference (3 years longer), and thus the LAK dataset contains overall more papers coming from the EDM conference. Filtering topics by years allows for observing the popularity of topics in a particular year and a particular field (LA or EDM). This further enables one to observe the shift in interest for a particular topic by researchers in the LA and EDM fields throughout the years. For instance, one can observe that before 2011, the topic of Learning Analytics was not much popular in the papers from the EDM field; thus this topic is not displayed at all in visualizations for years 2008-2010. In 2011, it boomed in popularity as indicated by the significant rise in the number of papers covering it. In fact, this was the first year the LAK conference was organized, and it immediately occupied the attention of researchers interested in the topic of Learning Analytics. Interestingly, this topic also gained some traction among the researchers publishing in the EDM field. In 2012, the topics popularity grew even bigger and the researchers covering it directed their effort toward the LA field. This resulted in papers published within the LA field to almost exclusively cover the topic of Learning Analytics. Similarly, we can observe topics that have kept high popularity in both areas over years. For instance, this is the case with the Data topic, obviously as a consequence of research in both areas concentrating on the analysis of large amounts of data coming from various learning systems and other sources. The application also allows us to observe that topics such as Intelligent Tutoring System, Prediction and Accuracy and Precision mostly kept their popularity throughout the years and stayed exclusively within the EDM field. On the other hand, one can observe that the large majority of topics have been covered by

6. REFERENCES
[1] Carter, S. Four Ways to Slice Obamas 2013 Budget Proposal. New York Times, 2012. Available online: http://www.nytimes.com/interactive/2012/02/13/us/politics/2 013-budget-proposal-graphic.html [2] Bostok, M., Carter, S., and Ericson, M. At the National Conventions, the Words They Used. New York Times, 2012. Available online: http://www.nytimes.com/interactive/2012/09/06/us/politics/c onvention-word-counts.html [3] Bostok, M., Ogievetsky, V., and Heer J. D3: Data-Driven Documents. IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis), 2011. Available online: http://vis.stanford.edu/papers/d3 [4] Dubinko, M. et. al. Visualizing Tags over Time. WWW 2006, Edinbourgh. Available online: http://labs.rightnow.com/colloquium/papers/visualizing_tags. pdf [5] Zhang H., Korayem M., You E., and Crandall D. J. Beyond Co-occurrence: Discovering and Visualizing Tag Relationships from Geo-spatial and Temporal Similarities. Available online: http://www.cs.indiana.edu/~zhanhaip/wsdm2012clustering.pdf [6] Lemma, R. Visualizing the DBLP Database. Bachelor Thesis, 2010. Available online: http://www.inf.usi.ch/faculty/lanza/Downloads/Lemm2010a. pdf

12

http://data.semanticweb.org

[7] Wattenberg, M. Arc Diagrams: Visualizing Structure in Strings. InfoVis 2002. Available online: http://hint.fm/papers/arc-diagrams.pdf [8] Ponzetto, S. P., & Strube, M. (2007, July). Deriving a large scale taxonomy from Wikipedia. In Proceedings of the national conference on artificial intelligence(Vol. 22, No. 2, p. 1440). Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Available online: http://www.hits.org/english/research/nlp/papers/ponzetto07b.pdf [9] Ferragina, P., & Scaiella, U. (2010, October). TAGME: onthe-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 1625-1628). ACM. [10] Mendes, P. N., Jakob, M., Garca-Silva, A., & Bizer, C. (2011, September). Dbpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems (pp. 1-8). ACM. [11] Osborne, F., & Motta, E. (2012). Mining semantic relations between research areas. The Semantic WebISWC 2012, 410426. [12] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. The Semantic Web, 722-735.

[13] Milne, D., & Witten, I. H. (2008, October). Learning to link with wikipedia. InProceedings of the 17th ACM conference on Information and knowledge management (pp. 509-518). ACM. [14] Siemens, G., & Long, P. (2011). Penetrating the Fog: Analytics in Learning and Education. Educause Review, 46(5), 30-32. [15] Osborne, F., & Motta, E. (2012). Making Sense of Research with Rexplore. The Semantic WebISWC 2012 [16] 1st International Conference on Learning Analytics and Knowledge, Banff, Alberta, February 27March 1, 2011, link https://tekri.athabascau.ca/analytics/ [17] Romero, C., & Ventura, S. (2010). Educational data mining: a review of the state of the art. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 40(6), 601-618. [18] Siemens, G., & Baker, R. S. D. (2012, April). Learning analytics and educational data mining: towards communication and collaboration. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge (pp. 252-254). ACM.

Analysis of the Community of Learning Analytics


Sadia Nawaz
Purdue University West Lafayette, IN, USA

Farshid Marbouti
Purdue University West Lafayette, IN, USA

Johannes Strobel Purdue University West Lafayette, IN, USA

sadia@alumni.purdue.edu ABSTRACT

fmarbout@purdue.edu

jstrobel@purdue.edu

The trends of the learning analytics community being presented in this paper are in terms of authors, their affiliation and geographical location. Thus the most influential authors, institutes, and countries who have been actively contributing to this field are brought out. In addition, this paper identifies collaborations among authors, institutes, and countries. The paper also tries to explore the research themes followed by the learning analytics community.

Analytics, scholars from different disciplines such as education, technology, and social sciences are contributing towards this field [6]. Different authors with different backgrounds, expertise and purpose publish and present their work in Learning Analytics related journals and conferences. To draw a better understanding of who are top collaborates in the field and which institutes and countries are more active in creating and disseminating knowledge, we analyzed the data described in the previous section.

1. DATA AND TOOLS


The data that is analyzed in this paper consists of the conference on Learning Analytics and Knowledge (LAK) 20112012, Educational Data Mining (EDM) conference 20082012 and the Journal of Educational Technology and Society (JETS) special edition on learning and knowledge analytics. This data was provided on the Society for Learning Analytics Research (SoLAR) website in xml format [1]. The xml data converted to tabular data using an xml to csv convertor [2]. The converted csv files were then processed and merged using macro programming in MS Excel. Later, this data was using NodeXL tool an open source template for Microsoft Excel [3]. It allows the user to work on different worksheets for different operations such as Edges worksheet can be used to compute the inter/intra collaboration. Vertices worksheet allows the display and computation of individual node properties such as degree, betweenness, centrality etc. Other tools that have been utilized in this paper include NetDraw [4] and IBMs Many-eyes [5].

3. AUTHORSHIP TRENDS
Complete summary of various (author related) statistics has been provided in table 1 (detailed definition of these graph theory related terms is available at [7]). Analysis of authors provides information which not only helps in understanding the growth of the field (in terms of publication counts and author counts etc.) but also is used to predict the future of the field e.g.,: information such as connected components and maximum edges in a connected component is showing that the graphs are getting well populated and connected thus, employing more inclination towards collaboration Overall it can be said that the field itself is growing as apparent from node counts (2008-2012) and article counts (the sum of single and multi-author article counts). Similarly, self-loop count together with single vertex connected component can show how many authors of the single authored publication have / have not collaborated (within this data)? e.g., the last column indicates that overall there have been 26 singleauthored articles by 25 authors. It was found that 14 of these authors have had no collaborative work in this data. And it was also found that Stephen E. Fancsali is the only author with two single authored publications.

2. MOTIVATION

With increase of attention to interdisciplinary field of Learning Table 1: Combined statistics for EDM, LAK and JETS Graph Metric (graph theory terminologies) Total unique vertices / nodes (authors) Unique edges (edge is loop for single author articles & straight line otherwise) Edges with duplicates (i.e., edge weight is greater than 1) (These edges show joint authorship in more than one publication ) Total edges Self-loop (single author articles) Multi-author article count Connected components (authors forming a cluster based on authorship) Single-vertex connected components (Count of the authors of single author articles who did not collaborate) Maximum vertices in a connected component Maximum edges in a connected component 2008 74 100 17 117 4 27 20 4 15 33 2009 79 106 18 124 1 31 22 0 7 16 2010 151 208 50 258 3 61 38 3 15 36 2011 193 251 42 293 10 75 53 8 29 72 2012 281 435 48 483 8 96 79 7 22 76 Total 623 938 337 1275 26 27 140 14 113 370

4. COLLABORATION TRENDS
Collaboration as defined in Oxford dictionary [8] is the action of working with someone to produce something and in current context it represents co-authorship of an article by two or more researchers. This term can be extended to institutes and even countries and hence extended collaboration patterns will be extracted between and within institutes and countries respectively. Table 2 shows that there have been 938 pairs of authors who collaborated just once (this number includes single author articles - since in that case a self-loop serves as an edge to itself). Alternatively, it can be stated that 73.57% of all articles have been written by the authors who have collaborated just once. It could either mean that new collaborations are forming or that the authors published just once and then they started working in other research areas, with other authors or they started targeting other venues. Therefore, initiatives such as LAK Data challenge will attract more researchers towards this field and hence may help in further growth and development of authorship networks. Table 2: Overall collaboration pattern Author Pairs Article Counts 1 10 2 6 2 5 10 4 15 3 110 2 938 1 1(10)+2(6)+2(5)+10(4)+15(3)+110(2)+938(1) =1275 Table 3 presents some of the top collaborators e.g., N.T. Heffernan had been a co-author with J.E. Beck and Z.A. Pardos in 6 articles. Such analysis can help in finding active researchers and collaborators in this field. Table 3: Top collaborators based on article count Author S. Ventura Neil T. Heffernan Arnon Hershkovitz Sujith M. Gowda Author C. Romero Joseph E. Beck, Zachary A. Pardos Rafi Nachmias Ryan S. J. d. Baker Article Count 10 6, 6 5 5

Table 4: Top 10 authors with highest degree counts Author Kenneth R. Koedinger Ryan S. J. d. Baker C. Romero Vincent Aleven S. Ventura Neil T. Heffernan Sujith M. Gowda Mykola Pechenizkiy Arthur C. Graesser Jack Mostow Degree 34 25 19 18 17 16 15 15 14 13 Article Count 17 11 11 5 11 16 5 7 4 12

6. GEOGRAPHICAL LOCATION
Next, the geographical analysis of this dataset is presented which aims to explore the countries that have been extending this field especially through contributions to the venues: EDM, LAK and JETS. There have been contributions from 41 different countries. For extracting this information, all aliases of a countrys name were merged e.g., Netherland, Netherlands, The_Netherlands etc. were all merged together. The top countries that have had international collaborations are provided in table 5. Clearly, USA and UK are on top of the list. To illustrate the collaboration patterns between countries figure 1 is drawn using NetDraw. In this figure an edge between two countries depicts the coauthorship between the researchers from these countries. The edge width (also represented by a number) shows the strength of such collaboration. Also, different symbols have been used for different nodes based on their betweeness values. Betweenness centrality is the number of times a node acts as a bridge along the shortest path between two other nodes [9]. Clearly, USA, UK and Germany are on top of this list based on degree and centrality measures. It is apparent that most of the nodes have betweenness value of zero as depicted with a + symbol. It indicates the peripheral nature of these nodes and thus depicts the birth or growth of this field in that newer nodes are being added and the graph is currently sparse. Figure 2 illustrates geographical diversity of collaborators. The smaller circles show lesser diversity in terms of collaboration (with researchers from other countries). Similarly, larger circles are indicative of the countries whose researchers have more diverse group of co-authors (from across the world). In this figure a small table at the bottom depicts the count of papers from each continent. Thus it brings out the most active region for research in the area of learning analytics. Clearly, North America and Europe are at the top of this list (complete geographical mapping is available at [5]). Table 5: Top international collaborators Country USA UK Australia, Germany Netherland Canada, Belgium, Greece, Spain Degree 11 10 6 5 4

5. DIVERSITY
Diversity in this context is the count of distinct researchers a given author may have worked with. Table 4 aims at identifying the contributors who have worked with most diverse group of authors e.g., K.R. Koedinger has worked with 34 distinct authors and Ryan Baker has worked with 25 distinct authors. We also extracted the graph of these top contributors (based on degree) i.e., a graph which includes these top authors and all of their collaborators; and it was found that this new graph consists of 128 authors (roughly 21% of the total authors). This percentage shows the significance of the top authors towards EDM, LAK, JETS and in general towards learning analytics.

Figure 1: Collaboration in terms of geographical location

Figure 2: Geographical diversity of collaborators

7. AUTHOR AFFILIATION
Next, the institutional affiliation of authors was analyzed and it was found that there have been contributions from 200 different institutes world-wide. The ranking of the top few institutes in terms of collaboration with other institutes is provided in table 6. The term degree represents count of unique institutes that a given institute may have worked with. This term can be influenced by both the article counts and the coauthor counts. Table 7 provides the institutes with highest count of intra-institute collaboration and table 8 provides the institute pairs that have had highest collaboration. Such analysis is beneficial to research institutes and organizations so that they may collaborate and extend further studies in the field of learning analytics. Figure 3 illustrates trends of collaboration between institutes. Table 6: Top institutes with highest counts of distinct collaborators Institute Carnegie Mellon University University of Cordoba Stanford University
Fraunhofer Institute for Applied Information Technology

Dept. Computer wetenschappen, KU Leuven Worcester Polytechnic Institute Open University of the Netherlands University of Pittsburgh

Degree 20 9 8 7 7 7 6 6

Table 7: Top institutes with highest count of intra-institute collaboration Institute Worcester Polytechnic Institute Carnegie Mellon University Eindhoven University of Technology University of Cordoba University of Memphis Universitat Oberta de Catalunya (UOC) University of North Carolina at Charlotte RWTH Aachen University Self-loop count 116 107 36 33 31 31 20 16

i.e., 2008-2009 this field is empty, similarly some of the articles in later years had this field empty. Therefore, it was decided to use the title field for the purpose of keyword extraction. The selection of title field rather than the abstract field for the purpose of keyword extraction relies on an earlier study by the authors of this paper [10]. Later, Hermetic Word Frequency Counter (HWFC) software [11] was used to parse out top 30 keywords for each year. Some of the common English keywords are already ignored by this software, as available in its stop word list. Other words which are apparent by the nature of the venues EDM, LAK and JETS were then manually eliminated (since they would not bring any insightful information for this analysis) e.g., student, learn, knowledge, education etc. Further refinement was made to merge varying instances of the same word such as visual, visualize, visualization etc. Then, IBMs Many-eyes software utility was used to obtain the Matrix Chart as provided in figure 4. In this figure top 30 keywords for each year have been presented. It should be noted that since the count of articles and venues has also increased over years; therefore, the relative rank or position of keywords will be discussed rather than absolute frequency counts. From this figure, it was found that the usage of some of the keywords such as visualization, intelligent, network* is increasing over time. Some keywords such as model*, system*, tutor* retain their ranks. The keywords online, collaborat*, performance etc. show fluctuating trends. Similarly, other trends can be interpreted. The authors further extracted the context of these keywords: it was found that visualization co-occurs with data-mining, intelligent appears with tutoring system. The word online has a broader class of co-occurring keywords which includes learning, education, university, assessment systems, tutoring, courses, curriculum etc. Interestingly, in 2012 the context changed to online communities, interactions and social learning etc. Due to space restriction further analysis cannot be provided in this paper.

CONCLUSION
In this paper the data of past five years of publications related to learning analytics are analyzed. The trends show increasing number of authors and more collaboration between authors as well as institutes. Geographical analysis of authors shows that scholars from different countries have been collaborating and contributing towards this field. Top authors, collaborators, and institutes are identified in this paper. The authors also attempted to bring out the research themes followed by the learning analytics community based on the frequency of the usage of keywords. The authors plan to extend this study based on authors disciplinary diversity and on the association between authors and their explored research areas within learning analytics.

8. RESEARCH THEMES
In order to track the research themes being followed by learning analytics society and to see their emergence over time, the authors conducted a keyword based analysis. The information for this analysis has been extracted from the keyword (subject) section of the data provided by Society for Learning Analytics Research (SoLAR) website [1]. However, for initial two years

Table 8: Top pairs for inter-institute collaboration Institute Worcester Polytechnic Institute Claremont Graduate University University of Belgrade Northern Illinois University Hochschule fur Wirtschaft und Recht Beuth Hochschule fur Technik Berlin Universidade Federal de Alagoas Fraunhofer Institute for Applied Information Technology Institute Carnegie Mellon University University of Memphis Simon Fraser University University of Memphis Hochschule fur Technik und Wirtschaft Hochschule fur Technik und Wirtschaft Carnegie Mellon University Saarland University Edge weight 37 18 9 9 8 8 8 8

Figure 3: Trends of collaboration in terms of author affiliation

REFERENCES
[1] Taibi, D., Dietze, S., Fostering analytics on learning analytics research: the LAK dataset, Technical Report, 03/2013 [2] LUXON SOFTWARE, 2013. Luxon software converter. http://www.luxonsoftware.com/converter/xmltocsv [3] NODEXL, 2013. NodeXL. http://nodexl.codeplex.com/ [4] Borgatti, S.P., 2002. NetDraw Software for Network Visualization. Analytic Technologies: Lexington, KY [5] IBM, 2013. Many eyes. http://www958.ibm.com/software/analytics/manyeyes/visualizations/a nalysis-of-the-community-of-learn [6] Ferguson, R. 2012. The State Of Learning Analytics in 2012: A Review and Future Challenges. Technical Report KMI 12-01, Knowledge Media Institute, The Open University, UK. http://kmi.open.ac.uk/publications/techreport/kmi-12-01 [7] YWORKS, 2013. Y works developers guide glossary. http://docs.yworks.com/yfiles/doc/developersguide/glossary.html [8] OXFORD DICTIONARIES, 2013. Oxford dictionary collaboration. http://oxforddictionaries.com/definition/english/collaborati on [9] WIKIPEDIA, 2013. Wikipedia centrality. http://en.wikipedia.org/wiki/Betweenness#Betweenness_ce ntrality [10] Nawaz, S., Strobel, J., 2013. IEEE Transactions on Education authorship and content analysis, under preparation [11] HERMETIC, 2013. Hermetic Word Frequency Counter. http://www.hermetic.ch/wfc/wfc.htm

Figure 4: Keyword analysis for research theme extraction

Cite4Me: Semantic Retrieval and Analysis of Scientic Publications


Bernardo Pereira Nunes bnunes@inf.puc-rio.br ABSTRACT
This paper presents the Cite4Me Web application and its features created for the LAK Challenge 2013. The Web application focuses on two main directions: (i) interlinking of the LAK dataset with related data sources from the Linked Open Data cloud; and (ii) providing innovative search, visualization, retrieval and recommendation of scientic publications from the LAK dataset and related interlinked resources. Our approach is based on semantic and cooccurrence relations to provide new browsing experiences to Web users and an overview of scientic data available. Furthermore, we present a detailed analysis of the LAK dataset along with applications which contributes to the development of the learning analytics eld.

Besnik Fetahu
L3S Research Center Appelstrasse 9a Hannover, Germany

Marco Antonio Casanova casanova@inf.puc-rio.br


PUC-Rio Rio de Janeiro, Brazil

PUC-Rio Rio de Janeiro, Brazil

fetahu@l3s.de
lications3 . However, current approaches by main digital library providers, such as ACM Digital Library4 and Elsevier5 , do not represent the current state of research on exploring resources using approaches from Information Retrieval, Information Extraction and Semantic Web. Thus, get an overview of research topics, nd publications and discover new nomenclatures are an arduous and laborious task that are not always successful. In this paper, we introduce Cite4Me a novel application for exploratory search, retrieval and visualization of scientic publications. Cite4Me intends to provide to the end users a single point for accessing papers and hence reducing eorts of searching in several data sources. Our system takes advantage of reference datasets, such as DBpedia6 , to explore semantic relationships between scientic papers and user queries. Additionally, an analysis of topic coverage and shared concepts from related educational datasets, extracted from the Linked Open Data cloud, will be introduced. The remaining of the paper is organized as follows. Section 2 presents the approach used for searching, retrieving and recommending papers. Section 3 describes the process of dataset discovery and interlinking and Section 4 shows a brief result analysis of the data discovery. Finally, Section 5 presents related work and Section 6 presents some concluding remarks.

1.

INTRODUCTION

The volume of information on the Web has been growing steadily over the last decade and has doubled every two years. The vast amount of data available on the Web along with new means of communications have transformed our society, including the way we work, live, relate to each other and learn. In the midst of change, the Learning Analytics emerges to make sense of the produced educational data reported by learners, professors, institutions and so on. Analyzing and understanding the changes along the past years help us to understand the current state and be aware of the forthcoming trends, enabling a new outlook of the future of learning. A recent challenge initiative of SOLAR1 and LinkedUp2 project arises to leverage the creation of tools that enables the analysis, visualization, browsing and recommendation of scientic and educational data. Although the scientic eld has fostered the creation of new applications in several areas, such as medical, biology, physics, amongst others, the information access is based mostly on free text search and on hierarchical classication system of the pubSociety for Learning Analytics Research - http://www. solaresearch.org 2 http://linkedup-project.eu/
1

2.

CITE4ME

As one of the main goals of the eld of Learning Analytics is to support students in their learning process, we developed a Web application called Cite4Me7 that assists students in making decisions to nd scientic publications and identify relevant research topics. Cite4Me implements semantic and co-occurrence methods to (a) search and retrieve scientic publications; and (b) recommend scientic publications. Moreover, it provides a Web interface that facilitates the search for publications and may help users on discovering related terms to a given query. In this section, we provide an overview of the major features of the Web application and its Web interface that assist users to explore scientic data on the Web.

2.1

Search and Retrieval

Cite4Me relies on search functionalities to meet the users needs. Briey, we implemented standard Information Retrieval (IR) and Semantic Web (SW) approaches to retrieve and recommend scientic papers to the users. We divided this subsection into (i) free text
3 4

Copyright c 2013 by the papers authors. Copying permitted only for private and academic purposes. LAK-Data Challenge 13 Leuven, Belgium

http://www.acm.org/about/class/ http://dl.acm.org 5 http://www.elsevier.com 6 http://dbpedia.org 7 http://www.cite4me.com/

search; (ii) exploratory search; and (iii) semantic search.

2.1.1

Free Text Search

The purpose of the free text search functionality is to oer users the abilities to search for mentions, titles and authors of academic publications contained in the LAK dataset. Even though, this functionality is similar to existing digital libraries, we agree that this is a basic functionality that must be provided by our application. Therefore, we use standard vector space models (tf-idf ) for indexing and retrieving documents. The tf-idf scores were computed for each term extracted from the publication content after applying stemming [14]. Furthermore, the searching functionality oers boolean queries with standard operators, such as OR, AND, and also a ranking of the matching publications based on the sum of tf-idf scores from the individual query terms. In summary, our free text search provides to the users publications (P) that match query terms and non-matching publications P , which are related to P according to a degree of similarity (see Eq. 1), but does not contain the query terms. The similarity between a matching publication P and other nonmatching publication P in the LAK dataset is measured by the standard cosine similarity measure, which is built on top of the computed tf-idf scores. S im(P, P ) = PP |P||P | (1)

for the entities contained in a publication. Finally, the ranking of the results is based on the sum of the tf-idf scores of the matching concepts. Figure 2 illustrates the semantic search functionality. It also generates a tag cloud from matching publications, showing the most prominent terms for a given query. Specically, the tag cloug based on the results helps the users to have an insight about the topics and may assist in nding related terms previously unknown by them.

2.2

Paper recommendation

Another key feature of our system is the paper recommendation based on semantic relationships extracted from reference datasets. The recommendation is based on a previous work [12, 11], where we exploit the number of paths and the distance (length of a path) between given entities to compute a relatedness score between extracted entities and associated documents. The rst step to measure the relatedness between documents is to compute the semantic connectivity score (S CS e ) of the entities found in each text (see Eq. 2).

S CS e (a, b) =
l =1

l> l | paths< (a,b) |

(2)

where P and P represent the tf-idf scores for the terms in two distinct publications.

l> where | paths< (a,b) | is the number of paths between a and b of length l and 0 < 1 is a positive damping factor. As in [12, 11], we used = 0.5 as our damping factor. Furthermore, we also constrained the length of a path to = 4. Based on the score for entities, we then dene the semantic connectivity score (S CS w ) between two documents W1 and W2 as follows:

2.1.2

Exploratory Search

In this section, we provide detailed insights on the exploratory search functionality of our application. As a preliminary step to provide analytics and information about the actual content and topics coverage, all the scientic publications contained in the LAK dataset are previously enriched. The enrichment process was performed using DBpedia Spotlight API8 , where entities, entity types and their respective categories were extracted. After the enrichment process, we cluster the publications according to entities and its categories found in each document. The publications are clustered in a tree-based structure over the enrichments. Note that, each node of the tree represents a topic in which a publication under this node covers. Thus, the exploratory search is performed through the topics covered by each publication. The process of linking publications, categories and extra resources is mediated by DBpedia knowledge graph, where we use the dcterms:subject property to match the resources. Thus, as a result, the exploratory search provides a way to explore resources through the connections between their topics, which facilitates the search for topically related resources. Figure 1 shows the exploratory search.

1 | E E | 1 2 S CS ( e , e ) + S CS w (W1 , W2 ) = e 1 2 2 | E | |E2 | 1 e E1 e1 2 E2
e1 e2

(3)

where Ei is the set of entities associated with Wi , for i = 1, 2. Note that documents that contain the same entities receive an extra bonus (the second term on the right-hand side of Eq. 3). Thus, a list of documents pairs is generated and ranked according the score and suggested to the user. Figure 3 illustrates the paper recommendation process computed based on S CS w .

3.

DATASET DISCOVERY AND INTERLINKING

This section briey describes the datasets used on automatic related data discovery from DataHub9 and future steps on dataset discovery and interlinking.

3.1

LAK Dataset

2.1.3

Semantic Search

Cite4Me provides also a semantic search engine that assists users to nd publications semantically related to the query terms. Analogously to explicit semantic analysis (ESA) technique [5], the relatedness score, is computed between the enriched concepts found in the publications content. Basically, the semantic search is an adaptation of the free text search presented in the Section 2.1.1. Instead of computing the tf-idf scores for the words in a text, it computes the tf-idf score
8

The LAK dataset contains the metadata of the papers published in the proceedings of LAK conference 2011-12, a special issue of Learning and Knowledge Analytics: Educational Technology & Society, the proceedings of the International Conference on Educational Data Mining (2008-12) and the Journal of Educational Data Mining (2008-12). In total, 315 descriptions of papers containing detailed information about authors, institutions, conference venues and the full content of the paper were available.

3.2
9

Data Analysis

The goal of the data analysis procedure is to align the various publications in the LAK dataset based on mutual information, such http://www.datahub.io

http://dbpedia.org/spotlight

Figure 1: Preview of the exploratory search funcionality.

Figure 2: Preview of the semantic search funcionality.

Figure 3: An example of paper recommendation based on S CS w .

4.

EVALUATION OF DATA ANALYSIS AND DATA DISCOVERY

This section presents an overview of the results obtained by analyzing the LAK dataset with respect to the constructed feature set that describes topics covered by individual publications. Moreover, based on the data analysis procedure and shared information, we show that the establishment of links between the dierent publications within the LAK dataset and from other datasets in DataHub is possible. In the following subsections, we show the analysis of the LAK dataset and the discovery of relevant datasets and publications.

4.1

Data Analysis

Figure 4: Relevant Dataset Discovery Framework based on the generated feature set used to query DataHub Linked Data provider.

as the topics covered by them. This is achieved using well established datasets like DBpedia10 and Freebase11 , where a reference point for the unstructured textual content of publications is created through an enrichment process. Again, the enrichment process is carried out using DBpedia Spotlight12 [10] and addresses several issues of signicant importance. For instance, it oers several advantages such as: (i) identication of (common) named entities, (ii) disambiguation; and (iii) expansion of the limited dataset and resource descriptions with additional background knowledge.

The data analysis of the LAK dataset focuses mostly on assessing the individual publications for their topic coverage. In this manner, we build a connected data graph consisting of the individual publications and items from the feature set. This step is necessary to provide the exploratory search functionality, where based on the established edges between publications and feature set items, we can navigate through the publications or topics of interest. Therefore, the results obtained with respect to the constructed feature set and LAK dataset graph are shown in what follows. Table 1 shows the top ranked items for each of the feature sets, along with the number of associations an item has with respect to all publications (entity, category and type items). Figure 5 shows the constructed data graph for the LAK dataset.

4.2

Data Discovery

3.3

Data Discovery

Our Web application uses as its starting point the instances in the LAK dataset to automatically explore and recommend to users, datasets that covers similar topics. In order to query, detect and interlink related datasets, we chose the DataHub as a data provider. DataHub serves as a collecting point of datasets from various elds and currently it has over 5000 datasets. Note that, from the large number of datasets, only 300 datasets are provided as Linked Open Data. As the latter is the main focus of our work, the analysis and interlinking process is focused for such datasets. Briey, the data discovery is performed using CKAN13 data management framework from DataHub, where based on data analysis and user interests (such as topics covered by a publication/resource) related datasets are suggested. Additionally, we provide to the user a set of resources, amongst other data analytics, that enables the user to harvest and correlate new information from the discovered resources, considering the LAK dataset as a starting point of such discovery. This approach presents several advantages such as the adoption and the widespread use of Linked Data principles for publishing scientic papers. Nowadays, many conferences make their proceedings and journals freely accessible, hence our approach would take advantage of such open data and oer users topically relevant papers for a particular resource in the LAK dataset.
10

After creating the feature set based on the information provided from reference datasets, we are able to query for relevant datasets in DataHub. Thus, for the top ranked feature set items, the data discovery for relevant resources is considered. Table 2 shows the discovered resources and datasets for the top-10 entity items. Note that, we focus only on bibliographic datasets, since we aim at recommending topically related scientic publications. Due to the lack of bibliographic datasets, we were not able to nd related publications for all entities considered. Table 2 summarizes the discovered resources. The dataset names are represented by their acronyms as follows: b3kat - Bayerische Staatsbibliothek", hebis - Hessisches Bibliotheks Informations System" and npg - Nature Publishing Group - ALL". Additionally, from the set of 96 bibliographic datasets available, only a few of them were oered as Linked Data, thus narrowing our search space for relevant resources. Entity Data Learning Data mining Algorithm Education Analysis Student Knowledge Methodology Statistics b3kat 14 5 0 4 17 42 7 11 4 7 hebis 0 0 0 0 1 1 1 0 0 0 npg 12 1 0 0 2 6 1 0 1 1

http://dbpedia.org 11 http://www.freebase.com/ 12 http://spotlight.dbpedia.org/ 13 http://www.ckan.org

Table 2: Number of discovered resources from the bibliographic group for the top ranked items from the entity feature set, based on the LAK dataset.

Entity Data Learning Data_mining Algorithm Education Analysis Student Knowledge Methodology Statistics System Scientic_modelling Prediction Data_set Statistical_classication Evaluation Standard_deviation Probability Behavior Interaction

Assoc. 90 80 67 50 49 48 46 46 42 41 37 37 36 36 30 29 29 28 26 24

Category Educational_psychology Data_analysis Learning Scientic_method Neuropsychological_assessment Greek_loanwords Data Evaluation_methods Computer_data Research_methods Systems_science Formal_sciences Data_management Cognitive_science Statistical_terminology Developmental_psychology Intelligence Data_mining Critical_thinking Thought

Assoc. 161 150 139 137 136 135 131 129 126 124 118 108 108 107 107 101 93 91 87 84

Type DBpedia:TopicalConcept Freebase:/book Freebase:/book/book_subject Freebase:/media_common Freebase:/media_common/quotation_subject Freebase:/computer Freebase:/education Freebase:/education/eld_of_study Freebase:/computer/software_genre Freebase:/internet Freebase:/internet/website_category Freebase:/award Freebase:/media_common/media_genre Freebase:/organization Freebase:/award/award_discipline Freebase:/business Freebase:/organization/organization_sector Freebase:/people Freebase:/lm Freebase:/book/periodical_subject

Assoc. 150 142 142 138 136 125 122 120 120 118 118 114 105 103 103 102 99 99 94 93

Table 1: Top ranked items from the feature set for the LAK Dataset, from the dataset analysis.
Industrial processes Unit operations Laboratory techniques Distillation Separation processes

Paper products Books

http://data.linkeded...

Quantification Computer security mo... Multivariable calculus Differential operators Religious fundamenta...

Lists byof country History Cambridge Oxbridge Organisations based ... Coimbra Group University of Cambri... Visitor Anglican ecclesiasti... Law degrees Culture in attractions Cambridge... Counterculture Anglican priests Universities and col... Legal education 1209 establishments ...

Russell Group Lawyers Case Western Reserve... Authority Lecturers Ecclesiastical titles institut... Educational

School examinations Educational qualific... 1951 introductions

Java specification Competition JVM programming lang... r... Java programming lan... Economic Social statusproblems Narcissism Programming language... Java platform Concurrent programmi... Oligarchy Social events Musical notation Predicate logic Propositional calculus

Cluster computing Fault-tolerant compu... Local area networks Supercomputers Theology Logic symbols Marxist theory

Popular culture Media studies

Genetic algorithms Analog circuits British brands Clothing retailers o... Theories of law Fashion programming design Retail Genetic companies of ...algorit... Reproduction Evolutionary Propositions Companies establishe... Teachable units for ... ininference Supermarkets Companies based L... of Nort... Immediate Flynn's taxonomy 1884 establishments ... Zoology Department stores Spirituality Department stores of... Electronic design Marks & Spencer Military Legal terms Multiple births Electronics terms

Hybrid vehicles Electric vehicles Drawing Engines Economic development Development economics

Artisans Host cities of theCompany... C... Hudson's Bay Populated places est... Orders of magnitude Edmonton Kindness Populated places est... Garment industry Systems of formal lo... Environmental design Landscape architecture Architecture Visual arts physical ... Individual Art of media Works art Syntax

Classes of computers http://data.linkeded... Secret military prog... Military projects Manhattan Project Machines Atomic bombings of H... of ... Military history Code names Articles with exampl... Military history of ... Criticism Nuclear weapons prog... Review websites Articles with of exampl... Units frequency Musical techniques Nuclear history of t... Rhythm Literary concepts Philosophy Nuclear weapons of t...

C programming langua... Class-based programm...

Multilingual websites Philosophy of biology Programming language... Inverse problems Crystallography Internet activism Lexis (linguistics) Dynamically typed pr... Programming language... Approximation algori... Collaborative projects Developmental biology Reality by type GNUstep Customary units of m... Objective-C NeXT General encyclopedias Discrete geometry Continuous mappings Habitat (ecology) te... Wikipedia Search algorithms Wikimedia projects Open content projects HistoryMathematical Chemical properties Landscape ecology Mineralogy Free encyclopedias constants Commerce Condensed matter phy...databases Articles with exampl... Lexical Mass Online dictionaries Online encyclopedias Internet properties ... Creative Commons-lic... ForcePhysiology Functional programming Grammatical cases Power laws Searching Imperial units Tails of probability... Statistical mechanics Quantity

Computational physics Computer physics eng... String (computer sci... Operators (programmi... Discovery Communicat... Video game development Television channels ... tel... Discovery Channel English-language Structural engineering Modes Bridges Catholic music Melody types music Ancient Greek

Neuropsychology Sexual arousal Sexual emotions

Personality MBTI types tests Personality typologies Jungian tradition Symbols Logical connectives Notation Semiotics Military units and f... Academic United States Navy o... Ad-hoc and for... Model theory Peerliterature review Taskunits forces ExpertScience witnesses VirtuePsychologists Tests studies Acronyms Mental health profes... Safety Master's degrees AIDS origin hypotheses Pandemics Scientists Musical composition Association of Indep... Health Syndromes disasters Proof theory HIV/AIDS Formal languages Vocabulary Christian iconography Combat Set theory Measuring instruments Trigonometry Monetary economics Federalism in the Un... Cross symbols Christian symbols Quality management Paper Enterprise architect... Heraldic ordinaries Papermaking Computer storage media Stationery Christian terms MilitaryTransducers operations ... Information architects Punched card Food safety Unit record equipment Recursion Electronics Evidence law Intention IBM unit record equi... Pragmatics Religious symbols Self care Programming language... Traditional Chinese ... materials Cognitive neuroscience Causality Employment compensat... Engineers Writing media IBM storage devices Government of the Un... Sensors History of computing... Packaging Mental content Money History of software School qualifications Theory of computation Units of angle Industrial automation Causal inference Articles to be merge... Central processing u... Programming idioms Instruction processing 1789 establishments ... ChinesePrinting inventions http://data.linkeded... Self-reference Physical exercise http://data.linkeded... Automation materials Microprocessors Paper art Buildings and struct... Style (fiction) Building Middle States Associ... Electricity Electric current Real estate Ballistics Theatre Culture jamming Traditions Alchemical processes Performance art tech... Combinatorics Measure theory Television technology Argumentson words Finite rings Integral calculus Digital imaging Ceremonies Instruction set arch... Graduation Science occupations Articles with exampl...Metalogic Computer-aided design Domain-specific prog... Languages Literature Internet radio Digital geometry Peercasting Art movements Psychiatry controver... Linguistic research Trains Computer graphics da... Physical layer proto... Internet broadcasting Art materials Firearm terminology Video game genres Corpus linguistics Engineering discipli... Literary Modular arithmetic Writing criticism occupations Fiction media syst... Microsoft Windows Internet television Rail transport Criminology Computer engineering Streaming Action (genre) North Central Associ... Psychiatric diagnosis Digital television Materials Action video games Continuing education Electronic engineering Labeling theory Cloud storage Philosophical method... Entertainment http://data.linkeded... Applications of dist... Political engineering Music Fiction-writing mode Metaphysics Academic administrat... Reality Video on demand serv... Association of Ameri... Analogy argume... Behaviorism Philosophical Wikipedia articles w... http://data.linkeded... Subjects taught in m... Drug delivery devices Homogeneous chemical... Independent agencies... Solutions http://data.linkeded... 1950 establishments Foundations based in... ... Examinations Dosage forms Organizations establ... Nothing Concepts in aesthetics Space National Science Fou... Colloidal chemistry Belief Physical chemistry Simple living Science and technolo... Pittsburgh History &... Veracity Funding bodies Environments Musical terminology Psychiatric institut... Educational institut... Elementary arithmetic Approximations http://data.linkeded...Laboratories Macroeconomics History of psychiatry Software bugs Core issues in ethics Universities and col... Documents Anti-psychiatry Medical terms Carnegie Mellon Univ... Political culture Oak Ridge Associated... Social institutions Integrated circuits Reliability Italian loanwords Failureengineer... Cohort studies http://data.linkeded... Inquiry Discovery and invent... Grammar frameworks Theories of aesthetics Maintenance Printing Semiconductor devices Arts Healthcare quality Units of time Syntactic transforma... Dispositional beliefs Rhetoric Higher category theory Evidence-based medic... Bias Health informatics Geometric shapes Fractions Communalism HTML Social theories Informal fallacies Noam Chomsky http://data.linkeded... Telecommunications Books by type Philosophical theories Morphology Giving Latin words and phra... Australian inventions Meetings Metanarratives Textbooks http://data.linkeded... Category theory Interest (psychology)Grid computing History of sociology Physical cosmology Fluid dynamics Fluid mechanics Ratios Least squares http://data.linkeded... 21st linguistics century Theoretical physics Single equation meth... Chronology New Thought terms Generative Binary operations Paradoxes String Pipingmechanics Cheminformatics Hierarchy Continuum Typingtheory Employment Economics of uncerta... Aerodynamics Millennia Modern history School terminology 20th century Natural philosophy Films by type ge... Multi-dimensional Idealism Wealth Adulthood Public speaking Architectural design Functionalism Module theory Conformity Philosophy of life Empiricism Development Neuroscience 3D computer graphics Real analysis People in informatio... Presentation Skills Industry Academic pressure in... Software requirements Construction equipment Schoolteachers Computer programmers Language psychology Education Positive and traini... Infrastructure National Association... Production economics Universities and col... Academic degrees http://data.linkeded... Virtual communities 3D imaging Educators effects Standards-based Construction educ... Elementary and prima... Outsourcing Community organizing http://data.linkeded... Visual Calendars Definition Selection Horology Performing Order theory Projects Design High schools and sec... Memory arts Patent law Education terminology Population genetics economics Philosophical logic Engineering occupati... 1801Welfare establishments ... Economic growth New York Post Spacetime http://data.linkeded... Teaching Individualism Alexander Hamilton Graphics file formats Publications establi... Patterns Syntactic entities Newspapers published... Education economics Standardized tests Accounting terminology Self School-related terms Contemporary artart Postmodern Youth Exploration Centimetre%E2%80%93g... Electronic documents Adobe Systems Engineering Project management Elections Economicconcepts indicators Orders of magnitude ... Research institutes ... Humanities Marriage Legal entities Legal research Vertical transport d... News Corporation sub... Occupations 1764 establishments Physical quantities Colonial architectur... Government instituti... Happiness SI base units Campuses http://data.linkeded... Energy ISO standards Economics ofCompanies regulat... Time Ethical principles Digital press Planned science and ... Cultural studies Massachusetts Instit... Embedded systems Learning to read Doctoral degrees Newbased England Scale model scales Organizations ... Associat... Corporatism Cohort study methods Political terms Introductory physics Agriculture Educational http://data.linkeded... Institutions founded... Didactics Motivation Utility Emotions Elevators Qualities of thought Labour law institut... modeling Grammar Fundamental physics ... Software optimizationScale United Nations Gener... Change management Private universities Personal development Corporations Forms of atti... government Students Occupations in music Positive mental Classical genetics Ethics Political science te... Think tanks Conductors (music) Harvard University Computer performance Analytic functions Titles Standards organizati... Model aircraft Dynamical systems Democracy Scientific documents Meaning (philosophy ... Rooms Health research Music performance Education reform Higher education Business law Rhode Island in... the ... Finance 16th arrondissement Help desk Concurrent computing Formal systems Brown University Evidence-based pract... Remote desktop Electronic feedback Uncertainty of numbers Public relations Organisation Mechanical vibrations Topology Numbers Educational facilities Risk Schools Corporations law for Eco... Public universities Theses Mathematical axioms Organizational culture Georgian architectur... Law Customer experience Socioeconomics ... Exponentials Filter frequency res... Page description lan... Psychology Communication science Social concepts Mathematical analysis Encodings Regulation Mathematical proofs Colonial Colleges Educational institut... History of ideas Concepts in physics Axiology Types of business en... Social psychology International econom... http://data.linkeded... Wave mechanics Non-profit organizat... Social anthropology Iterative methods Mathematics education Educational stages Television Academia Quality assurance Governance Budgets Fundraising Polynomials Culture COBOL terminology Professional titles ... Academic institutions Discourse analysis Object-oriented prog... Sociology of culture Philosophy of language Human behavior Validity (statistics) University of Chicago Vector graphics Statistics education Group processes Heat transfer Meta-analysis Harvard Medical School Role status Charles Sanders Peirce Gothic archi... Filter theory Mathematical optimiz... Computational fluid ... Mergers andRevival acquisit... Social philosophy Training .NET programming lan... Mathematical science... Underwater diving sa... Behavior Polymorphism Sociodynamics Systematic review Biotechnology Summary statistics f... Cambridge Identity Committee on Institu... Computer libraries Genetics or genomics... Compilers Sun Microsystems Greek inventions http://data.linkeded... Sorting algorithms Logic Universities and col... School types Professorial degrees Numerical analysis Abstract algebraMathematics Organizations establ... Home economics Syntactic relationsh... Community Equations Types of functions Applied linguistics Technology Infinity http://data.linkeded... Types of university ... Engineering Mind%E2%80%93body pr... Film and video techn... Control theory Elementary algebra Science education State functions Interpolation Coding theory Linear algebra Alternative education Theories of mind Algebra http://data.linkeded... Probability distribu... Philosophy of psycho... Sentences by type Phenomena Phenomenology Mathematical concepts Behavioral concepts Human resource manag... Psychological theories Interdisciplinary fi... DNA Production and manuf... Philosophical concepts Special functions Web 2.0 neologisms Local http://data.linkeded... Computer programming... Chemical engineering government Operations research Education policy Parallel computing Potential http://data.linkeded... Citizen media http://data.linkeded... Lexicography http://data.linkeded... Branches of philosophy Differential equations Attention-deficit hy... Human-based computat... Phase transitions Psychology articles ... Attention http://data.linkeded... Ecological metrics Peer-to-peer computing Functions and mappings Semantics Ontology http://data.linkeded... Linguistics Oral communication Manufacturing Optimal control Arabic loanwords File sharing networks Systems engineering Computer storage Culturaldev... economics Education by subject Neuroimaging Curricula Image processing Cryptography Population Behavioural sciences Radiology Evaluation Perception Recording EvolutionaryUnits biology Sound production tec... Economics terminology Brain methods theory Classified information of information... Integral transforms Accountability Learning Business Simulation Data collection Unsolved problems in... sensitiv... Information Consciousness Concepts in metaphys... Computer storage Optimization Storage media Mental processes Object-based program... Behavioral and socia... Organs algorit... English words and ph... Mathematical structu... Programming language... Academic publishing Leadership Demography Physics Wagering BASIC interpreters Geometric algorithms Greek loanwords Grammatical voices Microsoft Visual Stu... Film techniques BASIC compilers Microsoft BASIC Calculus Physical sciences http://data.linkeded... Capital Mathematical economics Creativity Business Positions of authority Concepts Procedural programmi... Management occupations Socialsociology economy Management Algorithm descriptio... Articles with exampl... Statistical forecast... Economic http://data.linkeded... Game theory Limbic system http://data.linkeded... Environmental issues... Biology theories Theories Emotion Bayesian inference Group theory Concepts in ethics Covariance and corre... Epistemology http://data.linkeded... Concepts in epistemo... http://data.linkeded... Human communication Error detection and ... Time series analysis Cognition Pedagogy Policy http://data.linkeded... Systems psychology DebatingExperimental psychol... Progressive Era in t... MultimediaSocial epistemology Biostatistics Mathematical relations Video game design Architects Mathematical termino... Philosophical school... Philosophy of educat... Science experiments http://data.linkeded... Politics by issue Formal methods Systems Educational television American philosophy Western art Radioactivity History of museums Social sciences Translation Education Arabic words and http://data.linkeded... phr... Mathematical physics Philosophical moveme... Branches of psychology Types of organization Normal distribution Sports terminology Historical scientifi... Metrology Social sciences meth... Education-related te... Pragmatism Geometry History of biology Museum collections Terminology Writing Bibliometrics Relationship counsel...Interpersonal relati... Government Demographics Research Political philosophy Forteana Educational assessme... Methods in sociology Westerncourt classical mu... Collecting http://data.linkeded... History of earth sci... development Standardization http://data.linkeded... Philosophy of mind Canadian system Knowledge sharing Consumer behaviour Human M-estimators Economies http://data.linkeded... Length http://data.linkeded... National security http://data.linkeded... Professional certifi... Critical thinking Strategic management Former courts and tr... Economic systems Demographic economics Mind Problem solving Compiler constructionReference Support vector machi... United States federa... ratios User interfaces Teleconferencing Statistical Nature Wikis Statistical terminol... Institutes Quality control Integrated developme... http://data.linkeded... Measurement Film production Transportation plann... Engineering statistics Video Hypothesis testing Process (computing) Bioinformatics 1993 introductions Supply chain managem... Tools Philosophy of science Associative arrays Greek words and phra... Population ecology Videotelephony Adolescence Statements Structure http://data.linkeded... http://data.linkeded... Article Feedback 5 A... Accessibility Operating system tec... Sociology index Knowledge Urban design Philosophy of mathem... Ergonomics Usability Probability assessment Linear programming Programming paradigms Proprietary database... Web analytics Data-centric program... Marketing Personality theories Postgraduate schools Vector calculus http://data.linkeded... Russian inventions Error Design of experiments ERP software Gambling terminology P-complete problems Internet marketing http://data.linkeded... Object-oriented prog... Biology 1992 software Social issues Library science Mathematical finance Mental structures Cybernetics Systems theory Concepts in logic Educational technology Programming language... Data warehousing Health Psychometrics Observation Continuous distribut... Privacy Elementary mathematics Convex optimization Acoustics Desktop database app... Microsoft database s... Analytics Digital rights Business planning User interface techn... Arrays Qualitative research Sociological termino... Literacy Pharmacokinetics Thought http://data.linkeded... History of mining Information Age Statistical theory Financial data analy... http://data.linkeded... Human rights Decision theory Periodic table Earth sciences Conditionals Clubs and societies Abstraction Organizations Nonverbal communicat... Statistical deviatio... Business terms Information technology Interrogative words ... Psychological testing Academic disciplines Statistical models Bayesian statistics Theory of probabilit... Society Articles with incons... Mining Knowledge representa... Regression analysis Neuropsychological a... Anthropological cate... Information Occupational safety science ... Genetics Programming language... http://data.linkeded... Classification systems Diagrams Social groups Ecology Computing Signal processing Articles with exampl... Sampling (statistics) Strategy Collaboration Process management Cultural history Mathematical logic http://data.linkeded... Economics Computer programming American inventions Sociolinguistics Innovation Urban studies and pl... http://data.linkeded... Computer occupations Innovation economics Evaluation methods Scientific revolution Planning http://data.linkeded... Actuarial science Resources Latent variable models Statistical inference International relati... http://data.linkeded... Science and technolo... Crowdsourcing Comparison of assess... http://data.linkeded... Information technolo... Reading Sources of knowledge Heuristics Parametric statistics Analysis Virtual reality Anthropology Law enforcement titles Educational software http://data.linkeded... http://data.linkeded... Learning Probability Means Standards Law enforcement occu... Types of marketing Conceptual models http://data.linkeded... Part-time employment Dynamic programming Randomness Medical statistics Product management Microsoft developmen... Humancomputer inter... Market research Ecosystems Publishing terms http://data.linkeded... Complex dynamics Law enforcement Science Business software Sociological terms Cognitive science Personal life Educational psycholo... Probability and stat... http://data.linkeded... Law enforcement occu... Computing terminology Summary statistics Conservation Distance education http://data.linkeded... http://data.linkeded... Police ranks Basic concepts se... Futurology Singapore Police Force Logic in and statistics Questionnaire constr... History of Organizational education behav... Formal sciences Source code Scientific method http://data.linkeded... Tool-using species Articles containing ... Vectors Functional analysis Human%E2%80%93comput... Learning management ... Cultural landscapes Communication Research methods http://data.linkeded... http://data.linkeded... Megafauna of Australia Hidden Markov models Decision Support Sys... Technical factors of... Virtual learning env... Expert systems Developmental psycho... http://data.linkeded... Environmental science Architectural termin... http://data.linkeded... Places Probability theory Methodology Physical objects http://data.linkeded... Critical phenomena Ship construction http://data.linkeded... Interaction Mathematical notation Epistemology of scie... Logical consequence http://data.linkeded... Estimation theory Forestry Computer data Interpretation (phil... Prediction Environment Economic anthropology Megafauna of Eurasia Statistical data sets Systems science http://data.linkeded... Personality Generalized linear m... Social research Sociological theories http://data.linkeded... Technical communicat... Aptitude Plants http://data.linkeded... Apes Mathematical modeling Materials science http://data.linkeded... http://data.linkeded... Megafauna of North A... Infographics http://data.linkeded... Natural language pro... Scientific modeling Trees Factorial and binomi... Online gaming services Artificial intellige... http://data.linkeded... Online games Internet properties ... Cosmopolitan species Human geography Information retrieval Folksonomy http://data.linkeded... http://data.linkeded... Secure communication Hypertext Sequences and series Photo sharing Socialization http://data.linkeded... Medicinal chemistry Plotting Statistical methods Econometrics Megafauna of South A... software Reputation management Risk management Archival science Discrete distributions Computer art Data management Auxiliary sciences o... 1999 introductions X86-64 Linux distrib... Real algebraic geome... Content management s... Plant morphology Educational video ga... Linux numerical anal... name system Groupware Sensitivity analysis Student culture Data analysis RiskFacebook analysis Systems ecology Numerical programmin... Complex Domain systems theory 1988 introductions Social bookmarking Real numbers Software architecture Self-organization Revolutionary tactics Intelligence Document management ... Mathematical series Semantic Web %3C!--Professions--%... Debian-based distrib... Data analysis software Network performance Megafauna of Africa Quality control tools Organizational theory IRIX software Computer-mediated co... Internet access Free software cultur... Permutations Taxonomy Health promotion Security http://data.linkeded... Probability distribu... Mathematical and qua... Transdisciplinarity http://data.linkeded... Educational psychology Debian http://data.linkeded... Mass media Zoomable User Interf... Postmodernism Computer network sec... Statistical intervals C software Blog hosting services Matter Interoperability XML-based standards Proprietary cross-pl... Business Poisson intelligence http://data.linkeded... Pharmacology Social systems English inventions processes Workflow technology 1993 software Symbian software Media technology Social constructionism Psychological attitude Virtual avatars Internet privacy Psephology Education inin the Uni... Companies establishe... Logical fallacies Array programming la... Statistical data types Cognitive psychology Social networks theory Performance management Matrices State schools the... Information systems Education schools Categorical data Video game culture Translation studies Communication Parsing Data Superorganisms Network theory Data structures Elementary special f... models Intelligence (inform... Biological systems Cascading Style Sheets Industrial design 2004 establishments 1989 introductions Markov Network addressing Mobile computers Deliberative methods Algorithms on strings ... Natural sciences Teacher training Windows Phone software Conjugate prior dist... http://data.linkeded... Exponential Application software software Web design Web services Types of databases Film and video termi... Websites which mirro... Types of library family d... Reasoning Games of Bada mental skill Computer architecture Software development Digital humanities Internet Protocol Cross-platform softw... trade Human rights by issue Hawaiian words and p... http://data.linkeded... Broadband Windows word http://data.linkeded... process... Digital libraries Vector spaces Community buildingInternational Stylesheet languages E-commerce Statistics Software design Graphical models Distributions with Megafauna c... Numerical linear alg... Social information p... Computational statis... Learning psychology 1983 software Articles in need of ... MUD terminology Underlying principle... Information Articles with exampl... Community websites Blog software Symbiosis Computer graphics Home computer software Companies establishe... Technical communicat... Mac OS X word proces... Industrial engineering Notetaking Stable distributions Dimension Animals described in... Curves Atari ST software Estimation of densit... Privately held compa... Bayesian networks RSS servers Collective intellige... Intellectual propert... 1969 introductions Open methodologies Decision trees Global internet Quality Proxy Quantitative research Mac OS word processors Marketing research c... comm... Mathematical sciences Holism Machine learning Computer security so... Scottish inventions Internet protocols Lifelong learning Invasive mammalStatistics spec... articles ... Social networking se... Computer security Classification algor... Deviance and pr... social ... Computer law 2000s in computer sc... Laptops Geography http://data.linkeded... Automatic identifica... Digital media Companies based in R... Data mining Social media Logarithms Articles with exampl... Communication design Cloud applications Representation theor... Sociology Buzzwords Experimental physics Environmental health Relational database ... BlackBerry software Crime prevention Rights Data modeling Charts InternetWeb 2.0 http://data.linkeded... http://data.linkeded... World Wide Web Representation theory Internet forum termi... Internet ages Theoretical computer... Persistent Worlds Prospect http://data.linkeded... theory Neologisms Web syndication form... Digital technology http://data.linkeded... Internet architecture Learning in computer... http://data.linkeded... Windows software Bilinear operators Enterprise applicati... New media http://data.linkeded... Internet memes Identifiers Algorithms Qualia Sociocultural global... Brand management Documentary film tec... Articles with exampl... Statistical classifi... Speech recognition http://data.linkeded... Sound Microsoft Office Science-related lists Sparse matrices Probability interpre... Service-oriented (bu... platforms Digital photography Control flow Metadata Computing Networks http://data.linkeded... http://data.linkeded... Analytical chemistry Waves Radiation health eff... Information technolo... Assistive technology Euclidean solid geom... Hearing Statistical charts a... Mutation Elementary shapes History of radio All articles lacking... Computability theory http://data.linkeded... Television genres Utility software type Websites http://data.linkeded... Computational resour... SQL Elementary geometry One Articles lacking sou... Complex numbers Computer science Articles including r... Metric geometry http://data.linkeded... Database theory Article Feedback 5 Web development Declarative programm... Articles with exampl... Applied mathematics Local government in ... Local government dis... http://data.linkeded... Web applications http://data.linkeded... Computational science Philosophy disambigu... Disk file systems Places in Berkshire ... Economics of transpo... Relational model Populated places est... Integrals History of telecommu... Modernism neuros... Computational Software companies b... Computer memory String similarity me... Area Constraint programming Economic geography Historical eras Query languages Towns in Berkshire Sociocultural evolut... Input/output Reporting Cloud computing States of the United... Computers Northern American co... Companies establishe... Electronic circuit v... DOS on authorities IBM PC compat... Linear operators in ... Computer languages Unitary ... Epidemiology Local authorities ad... Multivariate statist... Household Radio formats Member states ofincome NATO Theories of history Globalization Accesshttp://data.linkeded... control Database management ... Markup languages Subdivisions of the Cloud platforms Countries bordering ...... Logic in computer sc... Online education Countries bordering ... http://data.linkeded... IBM PC compatibles English-speaking cou... Kennet and Avon Canal First-level administ... Cultural geography Geographic informati... Income in the United... Website management Former confederations History of television United Particle physics States Visualization (graph... Superpowers Identity management Country subdivisions... Earth sciences data ... School districts... Internet search algo... Combinatorial optimi... States and territori... Data modeling langua... 1776 establishments Justification G8 nations Countries bordering ... http://data.linkeded... Logging Search engine optimi... Population density Google Income countr... http://data.linkeded... Bicontinental Library cataloging a... Thermodynamic entropy Computational comple... Algebraic structures Analytic geometry Link analysis Programming language... Internet companies o... Pharmaceutical indus... Data types Databases Secondary education Internet properties ... Philosophy of therma... Real-time web Combinatorics Statistical natural ... http://data.linkeded... Privately held compa... Cartography Geometric measurement Twitter Automotive transmiss... Automobile transmiss... Mechanical power con... Engagement http://data.linkeded... Knowledge bases Classical logic 19th-century mathema... 1856 births Saint Petersburg Sta... Russian statisticians Political theories 20th-century Nationalism mathema... People from Ryazan Sovereignty 1922 deaths Sustainability Full Members of war the ... Boolean algebra Aftermath of Grief Logical calculi Three-digit telephon... Members of the ... Red Hat Free Full package managem... Linux package manage... Distance Probability theorists Open University Health sciences education i... Russian mathematicians Health care Aid Inductive reasoning 1981 introductions Educational institut... ... 1969 establishments Emergency telephone Former Eastern Ortho... Primary care ... Exempt Medicine charities introductions Distance education i... MIPS 1968 Technologies of Commo... Archive formats Association Public services Charities based in B... Nursing International Charities based in S... Support develo... groups

Higher education in ... Higher education in ... Architectural commun...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

Frequency domain ana... Electrical circuits

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded... http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...
Integers

http://data.linkeded...

http://data.linkeded... http://data.linkeded...

Cardinal numbers Zonohedra Prismatoid polyhedra Space-filling polyhe... Cubes Platonic solids Volume

Number theory

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

http://data.linkeded...

Computer networking Youth rights Telecommunications e... Types of communities

2006 establishments ... Companies based in S... Free educational sof... Online chat Free learning suppor... Internet culture Free learning manage... Free Free content managem... software progra... Internet forums

Open problems Emerging technologies

English Heritage Listed buildings in Archaeology of... the U... Applied psychology British architecture Town and country pla... Aesthetics Consensus reality Educational research Text messaging Neural networks Companies listed on ... Technology in society Internet search engi... History of the Inter... Computer jargon Blogging Words coined in the ... Literary genres Politics and technol... Blogs Australian televisio... Ethernet English-language tel... Kohlberg Kravis Robe... Seven Network Networking hardware Television channels ... Internet terminology Network HTTP protocols

Systemic Risk - Beha... Collaborative software Subroutines Recommender Application layer pr... systemshttp://data.linkeded... Free software Multiple choice Programming constructs University of Cambri... Computational lingui... GraphArtificial theory intellige... http://data.linkeded... Files World Wide Web Conso... Computer file systems Clinical research http://data.linkeded... Inter-process commun... http://data.linkeded... Vision Photography Non-parametric stati... Computer security

http://data.linkeded...
Symmetry

Spam filtering

Markov processes Experiments Stochastic processes Chemical kinetics Environmental moveme... Coordinate systems Ren%C3%A9 Descartes Lie algebras Lie groups Geometric topology Nursing Food and Drugresearch Admini... Drug discovery Clinical trials South Asian countries Article Feedback Bla... India Member states of the... States and territori... G15 nations Member states of the... Countries the Ind... BRICS Former British colon...

http://data.linkeded... Graphic design

Cluster analysis Geostatistics Geodesy


Robust statistics

Maps

Exploratory data ana...

NP-complete problems Graph families Perfect graphs connectivity 1971 in Graph computer sci... Differential topology Complexity classes Parity Graph data structures Differential geometry Matrix normal forms Singular value decom... Glossaries of mathem... Manifolds Matrix theory Algebraic graph theory

Survey methodology Types of polling French loanwords Graphical user Android inter... IOS software software Computer file formats Software licenses Building engineering Advertising publicat... Server hardware Servers (computing) Java platform software http://data.linkeded... Statistical programm... Statistical software

System administration Services management ... Software industry Variable (computer p... GUI game widgets Video gameplay Business models Statistical outliers user inter... Software distribution Graphical Video game terminology Computer networks

http://data.linkeded... http://data.linkeded...
1985 software

Member states of the... G20 nations Federal countries Liberal democracies

Set families

Software engineering Free algebraic struc... Geometric group theory Combinatorial ... Propertiesgroup of groups

International rankings Systems Modeling Lan... Matrix decompositions Unified Modeling Lan... Open formats

Web browsers

http://data.linkeded... Humans
Asymptotic analysis Analysis of algorithms Trees (data structur... Data serialization f...

Software development... http://data.linkeded... Private equity portf... Software Project development... management s...

Fundamental analysis Stock market Foreign exchange mar... Derivatives (finance) Tasks of Natural lan... 2002 software .NET framework Commodities market Navigation Models oftr... computation Computer-assisted Machine translation Concurrency (compute... Formal specification... Petri nets applicatio... Microsoft GPS

Software using the M... Freedesktop.org Application programm... Operating systems Free graphics software Free windowing systems X Window System

Mac OS X software Distributed computin...

Thermodynamics Non-standard Binary arithmetic analysis Rotation Model selection Oracle acquisitions Free software compan... Surfaces Pi Google Earth of mathematics Software that uses Qt 2005 software Angle Lexical units Units History of information Words Linux software variable Field theory Regression ... Orientation Geometric centers Temperature Astrological aspects companies b... Binary treesSoftware Triangle centers Mathematics of infin... Keyhole Markup Langu... Remote sensing GIS file formats Bilinear forms Open Travel Alliance Differential geometr... Units of linguistic ... BigTable implementat... Companies establishe... Nucleic acids Freeware Primitive Circles types Affine geometry Virtual globes History of calculusmechanics Classical Orbits Celestial mechanics Kinematics Conic sections Euclidean geometry Companies based in S...

http://data.linkeded...
Spreadsheet software Mac OS software Inventory Signs of death Enterprise modelling Composting Business processwaste Biodegradable Anaerobic digestion Articles containing ... ... Discrete mathematics Network architecture Spreadsheet file file for... Bibliography fo... Presentation layer p... XML Computer hardware co... Cloud computing prov...o... Computer companies

Perimeter security Telecommunication th...

Hydrology Water waves Sociological theory Water streams Weather hazards Historiography Floodlandforms Weather Fluvial Geomorphology Rivers Water Postmodern theory Basic meteorological...

1969 in computer Directed graphs sci...

http://data.linkeded...

Windows administration Children Olympic sports Safety codes Stairways Childhood Mac OS user interface Architectural elements Figure Ice skating Sports entertainment Garden features dancing

Composite data types IBM Collier Trophy recip... 1975 establishments ... National Medal of Te... Electronics companie... Semiconductor compan... 1896 establishments ... Companies establishe... UML Partners American brands Information theory Computer storage com... Display technology c... Social classes Companies establishe... Socialism Abstract compan... data types Computer security Companies listed on ... so... Point of sale compan... Multinational Publicly traded comp... 1911 establishments ...W... Microsoft Software companies b... Companies based in Social divisions Companies establishe... Companies based in R... Transaction processing Dow Jones Industrial... Companies in the Dow...

Blogospheres

Solid mechanics Property se... Hacking (computer Web security exploits Deformation Injection exploits Vitaceae Property law Computer Viticulture security ex... Grape varieties Social inequality Software testing Security compliance

Wood Alumni of Woodworking Keele Univ... People from Stoke-on... The Pogues members 1955 births English guitarists English banjoists People Living associated wi... people Woodcarving

Figure 5: Topic coverage of LAK data graph for the individual resources.

5.

RELATED WORK

Cobo et al.[3] presents an analysis of student participation in online discussion forums using an agglomerative hierarchical clustering algorithm, and explore the proles to nd relevant activity patterns and detect dierent student proles. Barber et al. [1] uses a predictive analytic model to prevent students from failing in courses. They analyze several variables, such as grades, age, attendance and others, that can impede the student learning.Kahn et al. [7] present a long-term study using hierarchical cluster analysis, t-tests and Pearson correlation that identied seven behavior patterns of learners in online discussion forums based on their access. Garca-Solrzano et al. [6] introduce a new educational monitoring tool that helps tutors to monitor the development of the students. Unlike traditional monitoring systems, they propose a faceted browser visualization tool to facilitate the analysis of the student progress. Glass [8] provides a versatile visualization tool to enable the creation of additional visualizations of data collections. Essa et al. [4] utilize predictive models to identify learners academically at-risk. They present the problem with an interesting analogy to the patient-doctor workow, where rst they identify the problem, analyze the situation and then prescribe courses that are indicated to help the student to succeed. Siadaty et al.[13] present the Learn-B environment, a hub system that captures information about the users usage in dierent softwares and learning activities in their workplace and present to the user feedback to support future decisions, planning and accompanies them in the learning process. In the same way, McAuley et al. [9] propose a visual analytics to support organizational learning in online communities. They present their analysis through an adjacency matrix and an adjustable timeline that show the communication-actions of the users and is able to organize it into temporal patterns. Bramucci et al. [2] presents Sherpa an academic recommendation system to support students on making decisions. For instance, using the learner proles they recommend courses or make interventions in case that students are at-risk. In the related work, we showed how dierent perspectives and the necessity of new tools and methods to make data available and help decision-makers.

6.

CONCLUSION

In this paper we presented the main features of the Cite4Me Web application. Cite4Me makes use of several data sources to provide information for users interested on scientic publications and its applications. Additionally, we provided a general framework on data discovery and correlated resources based on a constructed feature set, consisting of items extracted from reference datasets. It made possible for users, to search and relate resources from a dataset with other resources oered as Linked Data. For more information about the Cite4Me Web application refer to http://www.cite4me.com.

7.

REFERENCES

[1] R. Barber and M. Sharkey. Course correction: using analytics to predict course success. In Proc. of the 2nd International Conference on Learning Analytics and Knowledge, LAK 12, pages 259262, New York, NY, USA, 2012. ACM. [2] R. Bramucci and J. Gaston. Sherpa: increasing student success with a recommendation engine. In Proc. of the 2nd International Conference on Learning Analytics and Knowledge, LAK 12, pages 8283, New York, NY, USA, 2012. ACM.

[3] G. Cobo, D. Garca-Solrzano, J. A. Morn, E. Santamara, C. Monzo, and J. Melenchn. Using agglomerative hierarchical clustering to model learner participation proles in online discussion forums. In Proc. of the 2nd International Conference on Learning Analytics and Knowledge, LAK 12, pages 248251, New York, NY, USA, 2012. ACM. [4] A. Essa and H. Ayad. Student success system: risk analytics and data visualization using ensembles of predictive models. In Proc. of the 2nd International Conference on Learning Analytics and Knowledge, LAK 12, pages 158161, New York, NY, USA, 2012. ACM. [5] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proc. of the 20th international joint conference on Artical intelligence, IJCAI07, pages 16061611, San Francisco, CA, USA, 2007. Morgan Kaufmann Pub. Inc. [6] D. Garca-Solrzano, G. Cobo, E. Santamara, J. A. Morn, C. Monzo, and J. Melenchn. Educational monitoring tool based on faceted browsing and data portraits. In Proc. of the 2nd International Conference on Learning Analytics and Knowledge, LAK 12, pages 170178, New York, NY, USA, 2012. ACM. [7] T. M. Khan, F. Clear, and S. S. Sajadi. The relationship between educational performance and online access routines: analysis of students access to an online discussion forum. In Proc. of the 2nd International Conference on Learning Analytics and Knowledge, LAK 12, pages 226229, New York, NY, USA, 2012. ACM. [8] D. Leony, A. Pardo, L. de la Fuente Valentn, D. S. de Castro, and C. D. Kloos. Glass: a learning analytics visualization tool. In Proc. of the 2nd International Conference on Learning Analytics and Knowledge, LAK 12, pages 162163, New York, NY, USA, 2012. ACM. [9] J. McAuley, A. OConnor, and D. Lewis. Exploring reection in online communities. In Proc. of the 2nd International Conference on Learning Analytics and Knowledge, LAK 12, pages 102110, New York, NY, USA, 2012. ACM. [10] P. N. Mendes, M. Jakob, A. Garca-Silva, and C. Bizer. Dbpedia spotlight: shedding light on the web of documents. In Proc. of the 7th International Conference on Semantic Systems, I-Semantics 11, pages 18, New York, NY, USA, 2011. ACM. [11] B. Pereira Nunes, S. Dietze, M. A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl. Combining a co-occurrence-based and a semantic measure for entity linking. In ESWC, 2013 (to appear). [12] B. Pereira Nunes, R. Kawase, S. Dietze, D. Taibi, M. A. Casanova, and W. Nejdl. Can entities be friends? In G. Rizzo, P. Mendes, E. Charton, S. Hellmann, and A. Kalyanpur, editors, Proc. of the Web of Linked Entities Workshop in conjuction with the 11th International Semantic Web Conference, volume 906 of CEUR-WS.org, pages 4557, Nov. 2012. [13] M. Siadaty, D. Gaevi c, J. Jovanovi c, N. Miliki c, Z. Jeremi c, L. Ali, A. Giljanovi c, and M. Hatala. Learn-b: a social analytics-enabled tool for self-regulated workplace learning. In Proc. of the 2nd International Conference on Learning Analytics and Knowledge, LAK 12, pages 115119, New York, NY, USA, 2012. ACM. [14] C. van Rijsbergen, S. Robertson, and M. Porter. New models in probabilistic information retrieval. 1980.

Visualizing the LAK/EDM Literature Using Combined Concept and Rhetorical Sentence Extraction
Davide Taibi1, gnes Sndor2, Duygu Simsek3, Simon Buckingham Shum3, Anna DeLiddo3, Rebecca Ferguson3
Institute for Educational Technologies Italian National Research Council Via Ugo La Malfa 153 90146 Palermo, Italy davide.taibi@itd.cnr.it
1

Parsing & Semantics Group Xerox Research Centre Europe 6 Chemin de Maupertuis F-38240 Meylan, France agnes.sandor@xrce.xerox.com

The Open University Knowledge Media Institute & Institute of Educational Technology Milton Keynes, MK7 6AA, UK firstname.lastname@open.ac.uk

ABSTRACT
Scientific communication demands more than the mere listing of empirical findings or assertion of beliefs. Arguments must be constructed to motivate problems, expose weaknesses, justify higher-order concepts, and support claims to be advancing the field. Researchers learn to signal clearly in their writing when they are making such moves, and the progress of natural language processing technology has made it possible to combine conventional concept extraction with rhetorical analysis that detects these moves. To demonstrate the potential of this technology, this short paper documents preliminary analyses of the dataset published by the Society for Learning Analytics, comprising the full texts from primary conferences and journals in Learning Analytics and Knowledge (LAK) and Educational Data Mining (EDM). We document the steps taken to analyse the papers thematically using Edge Betweenness Clustering, combined with sentence extraction using the Xerox Incremental Parser's rhetorical analysis, which detects the linguistic forms used by authors to signal argumentative discourse moves. Initial results indicate that the refined subset derived from more complex concept extraction and rhetorically significant sentences, yields additional relevant clusters. Finally, we illustrate how the results of this analysis can be rendered as a visual analytics dashboard.

Network analysis yields sets of related papers based on statistical corpus processing (Section 3). In order to improve the precision of information about the content of the connections among the papers, we carried out semantic and rhetorical analysis (Section 4). On the one hand, we extracted similar concepts in order to provide topical similarity indicators (Section 4.1) and, on the other hand, we extracted salient sentences that indicate the main research topics of these papers (Section 4.2). We repeated the statistical analysis of this reduced list of concepts, and of the reduced list of salient sentences. At the end of this paper, we present the design and implementation of the first prototype of an analytics dashboard (Section 5), which is designed to summarize results of the socio-semanticrhetorical analysis in a way that users will find both meaningful and easy to explore.

2. THE LAK DATASET


We selected the LAK Dataset1 published by the Society for Learning Analytics Research (SoLAR2), which provides machine-readable plain-text versions of the Learning Analytics and Knowledge (LAK) conference proceedings and a journal special issue related to learning analytics, and of the Educational Data Mining (EDM) conferences and journal. The corpus was extracted using the SPARQL endpoint of the LAK dataset. The corpus comprised the following: 24 papers presented at the LAK2011 conference 42 papers presented at the LAK2012 conference 10 papers from the journal of Educational Technology and Society special issue on learning analytics 31 papers presented at the EDM2008 conference 32 papers presented at the EDM2009 conference 64 papers presented at the EDM2010 conference 61 papers presented at the EDM2011 conference 52 papers presented at the EDM2012 conference

Categories and Subject Descriptors


K.3.1 [Computers and Education]: Computer Uses in Education

General Terms
Design

Keywords
Learning Analytics, Corpus Analysis, Scientific Rhetoric, Visualization, Network Analysis, Natural Language Processing

1. INTRODUCTION AND MOTIVATION


Our overall aims are to provide users automatically with suggestions about similar papers, about connections between papers, and to present these similarities and connections in ways that are both meaningful and searchable. In order to achieve this, we integrated three different approaches to linking and analysing a specific dataset of scientific papers (see section 2). These approaches were: 1. 2. 3. network analysis rhetorical analysis visualization of the results

For each resource, the title, description and keywords properties were used to feed the data mining processes employed in our analysis. At the end of this initial process, a relational database was used to store 305 papers, 599 authors, 448 distinct keywords. After this preliminary phase the entire LAK Dataset

LAK Dataset: http://www.solaresearch.org/resources/lak-dataset Published by SoLAR and made available to the LAK Data challenge of the 3rd International Conference on Learning Analytics and Knowledge (http://lakconference.org)

http://www.solaresearch.org

was analyzed by using the Xerox Incremental Parser (XIP) [1] for concept extraction and rhetorical analysis, a total of 305 papers, from which XIP extracted 7,847 sentences and 40,163 concepts.

3. STATISTICAL ANALYSIS
A preliminary analysis reported the most-used keywords, the most frequently occurring authors and the most-referenced papers. A second phase of analysis was then carried out using the data-mining tool, RapidMiner [2].

and their aggregations [4]. The yEd tool allows users to balance quality and speed of the cluster algorithm by the use of a slider. When the quality is set at the highest value, the Girvan and Newman algorithm is used in its normal form. At the opposite end, the lowest quality value produces the fastest running time. In this case it executes a local betweenness calculation following Gregorys algorithm [5]. When a mid value is chosen for quality and speed, the fast betweenness approximation of Brandes and Pich [6] is applied. In this case, less accurate clustering is balanced by a lower execution time. The clusters created with yEd have the following properties: each node (paper) is a member of exactly one cluster each node shares many edges with other members of its cluster, where edges represent the connection between a pair of papers if their similarity values is more than a threshold value (0.3 in our experiment). each node shares few or no edges with nodes of other clusters Figure 1 shows a visualization of the primary clusters. Some of the clusters did seem to have thematic coherence, while others were harder to label: Cluster 1: collaborative, learning, social Cluster 2: skills, model, slip, guess, parameters Cluster 3: causality, variables, model, construct Cluster 4: question, fit, grain, school, skill Cluster 5: translating, sentences, grinder, corpus

3.1 Statistical Data from RapidMiner


A three-step process was developed in order to analyze the corpus using the data-mining tool, RapidMiner: Process documents from file: this module generates word vectors from the text files. Select attributes: This allows users to select the attributes to be considered by the analysis. In our case, a threshold was set in order to eliminate less important elements in the word vectors. Data to similarity: This module was used to calculate a similarity index for the conference papers based on Cosine similarity.

The first block Process Documents from file is made up of the following steps: Tokenize: This operator splits the text of a document into a sequence of tokens. Replace token: This operator is used to replace tokens, for instance in cases where words are misspelled. Filter tokens (by length): This operator filters tokens based on their length. In our case, all the words with fewer than three characters were removed. Filter stopwords (English): This operator filters English stopwords from a document by removing every token that is the same as a stopword from the built-in stopword list. Stem (Snowball): This operator stems words by applying stemming using the Snowball tool.3

At the end of the main process, the Data to Similarity step returns two results: a) b) The list of the most relevant words (stemmed version) used in the entire corpus The measured similarity index between the papers that make up the corpus. Figure 1: Results of initial LAK paper clustering analysis The complete list of the papers belonging to the clusters has been reported in the web page5 associated to this work. This analysis was word-driven and not concept-driven. The next step was to try and refine this by distilling (1) a richer set of concepts, and (2) a more salient subset of sentences.

We employed the similarity relationships between papers to build a network of papers. In this network each node represents a paper, and an edge between two paper is created if the similarity value of a pair of papers overcome a threshold of 0.3.

3.2 Analysing the Network of Papers


The network of papers was then analysed with the yEd tool4 in order to extract clusters of documents using the algorithm for natural clusters based on Edge Betweenness Clustering proposed by Girvan and Newman [3]. This algorithm has been successfully used in Network Analysis to study communities

4. SEMANTIC ANALYSIS
In order to go beyond full-text statistical analysis and find connections between papers at the level of the claims they make, we processed the corpus using the Xerox Incremental Parser

3 4

http://snowball.tartarus.org http://www.yworks.com
5

http://www.pa.itd.cnr.it/lak-data-challenge.html

(XIP) [1] for extracting concepts and rhetorically salient sentences [7].

4.1 Concept Extraction


The basic module of XIP performs morphosyntactic analysis, part-of-speech tagging, constituent analysis and dependency extraction on free text. Since we define concepts as simple or compound noun phrases, they can be identified using general morphosyntactic analysis. Examples of extracted concepts are analytics, learning analytics, social learning analytics and social network analytics.

4.2 Rhetorical Analysis


Scientific research does not consist in providing a list of facts, but in the construction of narrative and argumentation around facts. In articles, researchers make hypotheses, support, refute, reconsider, confirm, and build on previous ideas in order to support their ideas and findings. The aim of rhetorical analysis is to detect where authors signal that they are making such moves. This analysis builds on the widely studied feature of research articles that, besides their well-defined standard structure (title, abstract, keywords, often IMRAD body structure) rhetorical moves emphasize articles contribution to the state of the art, and the research problems they address. In previous work [7] we described a list of rhetorical moves that characterize such salient messages, together with the extraction methodology. Figure 2 lists the detected rhetorical moves (in caps) together with examples of expressions that mark them.

A basic observation concerns the distribution of the pairs of similar papers yielded by the three methods. According to the expectations, the most similarity pairs have been yielded by taking into account the full text only in both the LAK and the EDM collection. There are considerable overlaps among the three methods, and there are cases when just one method yields similarity pairs. In subsequent evaluations we aim at evaluating these various cases. As a first step towards a more complete evaluation, we have selected some pairs of papers and checked their similarity according to some independent similarity indicators. We have found that our statistical method is coherent with independent similarity indicators in case of high similarity scores and that in these cases, similarity is found with and without XIP-extracted text. This indicates the validity of our statistical method in these cases for finding related papers. In the case where no independent similarity indicator could be found, but we do have XIP-based similarity pairs, we looked for related key claims or findings in the pairs of papers 7. In the cases where the similarity score between the two papers was high we did find such interesting related claims in the two papers. However, in cases where the similarity measure is low, we did not find any related claims. This indicates that we might want to define a threshold score. The details of the preliminary tests are reported in the web page.

5. XIP DASHBOARD
The XIP Dashboard was designed to provide visual analytics from XIP output in order to help readers assess the current state of the art in terms of trends, patterns, gaps and connections in the LAK and EDM literature. The dashboard also draws attention to candidate patterns of potential significance within the dataset: the occurrence of domain concepts in different metadiscourse contexts (e.g. effective tutoring dialogue in sentences classified as contrast). trends over time (e.g. the development of an idea) trends within and differences between research communities, as reflected in their publications.

5.1 Implementation
Figure 2: Rhetorical moves (in capital red letters) followed by some examples of expressions used to signify them in papers Once the XIP concept extraction and rhetorical analysis were concluded we repeated the cluster analysis on the XIP-filtered lists of concepts and salient sentences. Thus our statistical analysis (described in Section 3.) of the LAK dataset has been conducted in three different ways: considering the full text of the articles considering only the salient sentences extracted by XIP considering only the concepts extracted by XIP All the papers in the LAK dataset were analyzed using XIP. The output files of the XIP analysis, one per paper, were then imported into a MySQL database, and the user interface was implemented using PHP and JavaScript, making use of Google Chart Tools for the interactive visualizations.8

5.2 User Interface


The dashboard consists of three sections, each showing different analytical results in different types of chart. Section one of the dashboard shows two line charts, representing the LAK and the EDM conferences respectively. Each line chart shows the distribution of the number of salient sentences over time and by rhetorical marker type (see Figure 2 for a list of the types of rhetorical markers). Each coloured line in these line charts indicates how many sentences of a specific rhetorical type were extracted, and how this number changed by year (Figure 3 shows the line chart for the EDM conference).
7

The comparison of the sets of papers yielded by the three approaches is still ongoing. At this stage we can only present some preliminary observations concerning pairs of similar papers yielded by the three kinds of input. The data obtained through this preliminary evaluation is reported in the web page6 related to this work.
6

The related claims have been searched by reading the pairs of sentences. Our long-term goal is to provide the related claims automatically.
https://developers.google.com/chart

http://www.pa.itd.cnr.it/lak-data-challenge.html

6. SUMMARY
This short paper has summarised an approach to conducting analytics on Learning Analytics. The LAK Dataset comprising LAK and EDM literature has been analyzed in order to identify clusters of papers dealing with similar topics (conceptual clustering), and in order to identify key contributions of papers in terms of the claims authors make, as signalled by rhetorical patterns. Our preliminary tests are promising, but more thorough testing is needed to validate the method. Finally, we showed how the results of this analysis are beginning to be visualized using an analytics dashboard. All the secondary datasets produced have been published as open data, for further research. Figure 3: Rhetorical sentences graphed by year, for EDM The second section of the dashboard (Figure 4) allows users to select a combination of the extracted concepts, in order to visualize the occurrence of these concepts in papers within any or all research communities represented in the corpus that is to say across the whole LAK dataset (EDM plus LAK conference).

Figure 6: Distribution of rhetorical types in XIP-classified sentences within a selected concept bubble In the longer term, the aim of this research is to provide users with automatic suggestions about similar papers and about connections between papers, and to present these similarities and connections in ways that are both meaningful and searchable for the users. Future steps will validate the outputs from these analyses with researchers, and test the usability of the dashboard with different end-users (e.g. researchers, educators, students).

Figure 4: Number of papers with rhetorically extracted sentences containing user-selected concepts The third dashboard section consists of a bubble chart that displays the occurrence of papers within the entire dataset, filtered by user-selected concepts (Figure 5). This visualization can be restricted to display just the LAK or the EDM conference. In Figure 5, each bubble represents a concept that has been selected by the user. This is associated with a specific number of papers and sentences in which that concept has been detected. The colour saturation of each bubble (expressed by the color spectrum shown at the top) represents the density of the chosen concept as defined by the number of XIP-extracted sentences in which the concept occurs. The darker the colour, the greater the density.

7. REFERENCES
[1] Salah At-Mokhtar, Jean-Pierre Chanod, and Claude Roux. (2002). Robustness beyond shallowness: incremental dependency parsing. Natural Language Engineering, 8(2/3):121-144. Jungermann, F. (2009). Information extraction with RapidMiner. In Proceedings of the GSCL Symposium Sprachtechnologie und eHumanities. W. Hoeppner, ed. Girvan M. and Newman. M. E. J. 2002. Community structure in social and biological networks. Proceedings of the National Academy of Sciences. 99, 12, 7821-7826. Newman MEJ: Detecting Community Structure in Networks. Eur Phys J B 2004, 38:321-330. Gregory, S.: Local Betweenness for Finding Communities in Networks. Technical Report, University of Bristol (2008). Brandes, U., Pich, C., Centrality Estimation in Large Networks. Intl. Journal of Bifurcation and Chaos in Applied Sciences and Engineering 17(7) 23032318 gnes Sndor. (2007). Modeling metadiscourse conveying the author's rhetorical strategy in biomedical research abstracts. Revue Franaise de Linguistique Applique 200(2): 97-109

[2]

[3]

[4] [5]

[6] Figure 5: Concept density within XIP sentences, by year and number of papers When a concept bubble is selected (Figure 6), a pie chart pops up representing the relative distribution of the rhetorical types for that bubble (that is to say for that concept, and across the papers and sentences in which the concept has been detected). [7]

Ontology Learning to Analyze Research Trends in Learning Analytics Publications


Amal Zouaq
Department of Mathematics and Computer Science Royal Military College of Canada Kingston, ON, Canada +1 613 541 6000, Ext. 6478

Sreko Joksimovi

Dragan Gaevi

School of Interactive Arts and Tech- School of Computing and Information nologies Systems Simon Fraser University Athabasca University Surrey, BC, Canada Athabasca, AB, Canada +1 778 782 7474 +1 604 569 8515

amal.zouaq@rmc.ca

sjoksimo@sfu.ca

dgasevic@acm.org

ABSTRACT
In this paper, we show how ontology learning tools can be used to reveal (i) the central research topics that are tackled in the published literature on learning analytics and educational data mining; and (ii)relationships between these research topics and iii) (dis)similarities between learning analytics and educational data mining.

papers presented at the LAK conference editions and another one for the papers presented at the EDM conference editions in order to compare the two conferences based on concepts and relationships gauged as most important. We also performed analysis based on (a) paper abstracts only and (b) main body of text of the papers. In this short report, we first describe the data analysis pipeline. This is followed by a very brief discussion of a small fragment of the results we obtained in our analysis. The complete results in the CSV format are available at [8].

Categories and Subject Descriptors


I.2.7 [Artificial Intelligence]: Natural Language Processing; G.2.2 [Discrete Mathematics]: Graph Theory

General Terms
Algorithms, Measurement, Experimentation

2. DATA ANALYSIS PIPELINE


The data analysis relies on our ontology learning tool, OntoCmaps[10]. Ontology learning from text is a multi-layer knowledge extraction task that targets the following components: Terms and concepts: The first step consists in identifying candidate expressions in texts. These expressions are then ranked using some kind of measure (statistical metrics, graph-based metrics, etc.) to extract those that are relevant for the domain. These filtered relevant expressions are then considered concepts in the ontology learning community. Taxonomy: This step identifies is-a links in texts, generally using patterns indicating a taxonomical link in text such as Hearsts patterns[11], or using the inner structure of multiword expressions. For example, a carnivorous plant can be considered a plant just by looking at the syntactic structure Adjective noun of the expression. Conceptual relationships: This step uses various techniques (patterns, machine learning, etc.) to identify any kind of transversal relations, with a domain and range. Axioms: Finally, axioms here mean defined classes, or rules from texts. OntoCmaps requires a domain corpus as input. As such, LAK and EDM proceedings (the LAK dataset [13]) were an appropriate set of texts to test the ontology learning process. OntoCmaps relies on three main phases to learn a domain ontology: 1) the extraction phase that performs a deep semantic analysis based on dependency patterns; 2) the integration phase that builds concept maps, which are composed of terms and labeled relationships, and uses basic disambiguation techniques. These concept maps form a graph; and finally 3) the filtering phase where various metrics rank the items (terms and relationships) in concept maps.

Keywords
Ontology learning, deep parsing, filtering, information retrieval, ranking algorithms, graph theoretic statistics

1. INTRODUCTION
Learning analytics is a new research discipline. Although it attracted a considerable amount of attention in educational research and practice, debate is still very active about the scope of the discipline. The definition of learning analytics offered by the Society for Learning Analytics Research [7], which is commonly used in the literature to date, gives a general framework for the main tasks learning analytics are about. However, given the youth of the discipline, there are generally two open questions: What are the central research topics that are tackled in the published literature? What are the relationships between the central research topics? What are similarities and differences between learning analytics and educational data mining?

To address the above questions, we aimed to analyze systematically textual content available in the LAK Challenge data set. In particular, we used a state-of-the-art ontology learning tool, OntoCmaps, that enabled the automatic (i) parsing of textual content, (ii) creation of conceptual maps based on the extracted concepts and relationships, and (iii) filtering/ranking of the most important concepts and relationships based on measures of information retrieval, graph theory, and voting theory. The concept extraction and their filtering/ranking was done (i) for each edition of the two conferences and the journal special issue (from the LAK 2013 Challenge dataset)individually (i.e., LAK 2011-2012, EDM 20082013, and LAK ET&S special issue) to see the emerging trends through the years; and (ii) by creating two subsets one for the

2.1 The Extraction Phase


In the extraction phase, OntoCmapsis based on a hierarchy of syntactic patterns. Each pattern describes a set of syntactic rela-

tionships that permit the extraction of a semantic representation. OntoCmaps does not rely on any predefined domain knowledge. It uses two NLP tools to obtain the syntactic representations: the Stanford Parser along with its dependency module [2] and the Stanford parts-of-speech (POS) Tagger [6]. Given a sentence, the Stanford parser generates syntactic dependency relations between each pair of related words of a sentence. The POS Tagger identifies words parts-of-speech. Based on these two inputs, OntoCmaps creates a pattern syntactic format that enriches words in each dependency relation with their parts-of-speech. This enriched representation is then used as input to a pattern recognition task. A recognized pattern fires a rule that applies various transformations on the syntactic representation to obtain a semantic representation, in the form of expressions, triples or sets of triples. The patterns are divided into conceptual patterns and hierarchical patterns. Hierarchical patterns concentrate on the extraction of taxonomical links, following the work of [11], but based on the dependency formalism. Conceptual patterns identify the main structures of the language that can be transformed into triples useful for the extraction of conceptual relations. They are organized into a hierarchy from most-detailed patterns (containing the biggest number of dependency relationships) to least detailed. The extraction phase targets deeper levels of the hierarchy first to avoid extracting too abstract or incomplete representations. For instance, if the pattern nsubj-dobj-xcomp exists in text, the extractor should fire it instead of firing one of its higher-level counterparts nsubj-dobj and nsubj-xcompwhich contain only a subset of the syntactic relationships of interest. If a pattern is instantiated, then all its parents in the hierarchy are disregarded.

include: The Degree centrality of a node which identifies the number of edges from and to a given node. The Betweenness centrality, which assigns each node a value that is derived from the number of shortest paths that pass through it; The HITS algorithm which ranks nodes according to the importance of hubs and authorities [5]. This resulted in two measures Hits-Hubs and Hits-Authority; The PageRank of a node [1]; We also computed standard information retrieval metrics, mainly term frequency (TF) and TF-IDF.

Finally, using the graph-based metrics, we defined a number of voting schemes with the aim of improving the precision of filtering. All the VS relied on three metrics that were identified as being among the best metrics in previous experiments [10][11]: Degree, Betweenness and HITS-Hubs. The VS include: The majority voting scheme, which recognizes a term as an important one if it is chosen by at least k metrics out of n with k>n/2. Borda Count Voting Scheme: This method assigns a rank to each candidate. A candidate who is ranked first receive n points (n=size of the domain terms to be ranked), second n-1, third n-2 and so on. The score of a term for all metrics is equal to the sum of the points obtained by the term in each metric. Nauru Voting Scheme: The Nauru voting scheme is based on the sum of the inverted rank of each term in each metric. It is used to put more emphasis on higher ranks.

2.2 The Integration Phase


In this integration phase, all the extracted relationships are gathered into concept maps. Some basic term disambiguation tasks are performed at this level mainly: i) lemmatization which considers singular, plural and other forms of the same terms or relationships as referring to a single concept or relationship; ii) basic synonym detection based on abbreviation relations that are generated by the Stanford parser and iii) a kind of co-reference resolution phase that is built in some of the patterns, and that allows for the creation of semantic links between terms in a sentence, even if not direct dependency links existed in the original dependency representation. For example, in the sentence: carnivorous plants are organisms which eat insects, the co-reference resolution creates a relation eat between the term carnivorous plants and the term insects while the grammatical representation links the term plants to the term insects. All these operations result in concept maps around various terms. For example, if there were a number of statements around the term carnivorous plants in texts, it is likely that a concept map around carnivorous plants will be created. This process is repeated for all identified terms and relationships and results in an aggregation of concept maps through links between various concept maps, thus constituting a graph, with terms representing nodes, and relationships representing edges.

Table 1 shows the top ranked concepts based on the majority voting scheme. All the base metrics (Betweenness, PageRank, Degree, etc.) and voting schemes have been computed and can be found at [8]. The Web site [8] also features a visualization of the extracted data based on the obtained concept maps. The visualization is performed per venue (EDM/LAK/ETS-SI), per corpus (only abstracts or main texts) and per year (2008-2012).

2.3.2 Relationship Filtering


Similarly, a number of metrics were used to identify important relationships. The first measure consists of all the relationships that occur between important terms (determined through the voting schemes) as important relationships. This constitutes our voting schemes for relationships, which were based on the results of the majority voting scheme for concepts. The second measure ranks relationships based on Edge Betweenness centrality, which is a measure of the importance of edges based on the number of shortest paths which contain them. The third measure is based on assigning frequencies of cooccurrence weights based on the Dice coefficient [9], a standard measure for semantic relatedness. Table 2 shows an excerpt of the top ranked relationships based on the majority voting scheme. Contrary to standard named entity extractors, an important aspect of using ontology learning is the ability to extract relationships as well, thus, obtaining not only topics but also relationships (taxonomical and conceptual) between these topics. A better approach would mix the two approaches and combine topic extraction using named entity extractors, linked data semantic annotators and ontology learning.

2.3 The Filtering Phase


The third and last phase for learning the domain ontology is the filtering phase, which aims at ranking the items in concept maps (domain terms, taxonomical links, and conceptual links).

2.3.1 Concept Filtering


A number of metrics from graph theory and from information retrieval are used to identify relevant terms. Graph-based metrics were computed using the JUNG framework [3]. These metrics

Table 1.Top ranked concepts based on the majority voting scheme extracted the subsets of the LAK 2013 Challenge dataset
LAK (abstracts) student (0.50) datum (0.45) informal_learn (0.31) learn (0.31) teacher (0.29) model (0.27) learning_analytics (0.26) learner (0.25) social_factor (0.21) social_learn (0.19) effective_learn (0.19) group_learn (0.17) knowledge_ professional (0.17) Lak (0.17) knowledge (0.17) LAK (paper body) student (0.75) datum (0.20) learner (0.15) course (0.15) analysis (0.12) activity (0.11) user (0.10) tool (0.10) learn (0.09) analytics (0.07) group (0.07) system (0.07) teacher (0.06) instructor (0.06) network (0.06) EDM (abstracts) student (0.75) model (0.38) datum (0.37) method (0.19) paper (0.16) system (0.13) result (0.12) approach (0.11) skill (0.08) analysis (0.07) intelligent_ tutoring_system(0.07) behavior (0.07) tool (0.07) work (0.06) Researcher (0.06) EDM (paper body) student (0.75) model (0.23) datum (0.19) skill (0.09) problem (0.08) result (0.06) method (0.06) parameter (0.05) question (0.05) performance (0.05) system (0.05) approach (0.04) example (0.04) feature (0.04) item (0.04)

Table 2.Top ranked relationships based on the majority voting scheme extracted the subsets of the LAK 2013 Challenge dataset. Each cell in the table contains a concept-relationship-concept triplet
LAK (abstracts) learnerbuildknowledge (1) datumobtained (0.81) fromlearner LAK (paper body) coursebeing recorded as well as tostudent (1) datumbreak ability to educate effectively student (0.60) systemaddresses individuallystudent (0.45) analysishave since been moved asstudent (0.37) networkimpacting student (0.31) processfinally should promote reflection on instructor (0.29) toolidentifystudent (0.27) datummay be presented tolearner (0.25) activityconducted user (0.25) by EDM (abstracts) datumminingmethod (1) methodlinguistics inpaper (0.95) EDM (paper body) modelfitstudent (1) datumare collected far fromstudent (0.96) skillwill have been covered bystudent (0.67) problemassign for student (0.67) example parameterization by student (0.63) questionwere based student (0.62) studentprovides useful evidence to model (0.60) steprequiresstudent (0.57) performance dependent upon student (0.56) accuracyvaries acrossstudent (0.48) studentis guessing result (0.48) studentcollectdatum (0.45) worduttered by student (0.44) datumwere used to buildmodel (0.44) skillare included in model (0.41)

learning_analyticsimportant step forteachers_of_tomorrow (0.78) teachers_of_tomorrowis teacher (0.77) a

modelare trained overdatum (0.70) systemprovidesstudent (0.61) studentare represented bymodel (0.56) modelcan detectstudent (0.50)

toolincorporate functionality to accessdatum (0.65) modelcan be used to inform student (0.64) datumobtained (0.62) frominstructor

datumderived fromstudent (0.43) goalhas been investigated researcher (0.42) by

learnergeneratingdatum (0.58) studentaccessing online_discussion_forum (0.56) modelcan be used to inform teacher (0.51) studentflock toonline_service (0.48) datumare combined to calculate likelihood_of_student (0.45) instructorguidestudent (0.39) learnintegral to success_of_community (0.37) likelihood_of_studentis related tostudent (0.36)

tutoring_systemis asystem (0.40) studentstudy with intelligent_tutoring_system(0.39) skillstudied intutoring_system (0.38) intelligent_tutoring_systemare informed bydatum (0.32) analysisrevealsunexpected_result (0.30) unexpected_resultis aresult (0.30) collaborativelearning interactions_of_student (0.29)

groupwill contain student (0.25) environmentcapture datum (0.24) modelhighly accurate onstudent (0.22) averagemissstudent (0.21) roleare imposed on student(0.21) informationuseful for student (0.20)

We can also notice that we were not always successful in extracting meaningful relationships labels from this corpus. One possible explanation is the type of texts (publications) and the amount of noise in these texts. In fact, OntoCmaps is made to run on clean plain sentences that describe a domain of interest and define it. Parts of research papers such as figure captions, formulas, and references represent noise for OntoCmaps. Additional cleaning of the input texts would be necessary. However, even when the labels were not meaningful, the existence of a link between two concepts (unlabeled relationship) was shedding some light on the domain (see Section 3).

cases, such as learning_analytics, the lemmatizer returned the expression itself). First, we could not possible include all the results of all the metrics we calculated in our experiment (those results are available at [8]). Second, we selected the metrics which were proven to be most accurate in our previous research [10], [11]. Finally, it should be noted that the purpose of our experiment here was not to evaluate the effectiveness of individual metrics, but rather to experiment if ontology learning technology can shed some light on the questions posed in the introduction of relevance to the LAK 2013 Data Challenge. Concepts reported in Table 1 reveal that papers of both the LAK and EDM conferences have students, data and models as shared concepts. However, it is clear that LAK papers also focus on teachers/instructors, informal learning, and social, networked, and group learning. On the other hand, EDM papers focus on (data mining) methods and approaches, intelligent tutoring systems, features (extraction), and various types of parameters.

3. FINDINGS
In this section, we present only results of the 15-top ranked concepts and relationships according to the Majority Voting Scheme (Betweenness, Degree, and Hits-Hub) as shown in Tables 1-2 (N.B. As can be noticed in the tables, the majority of the terms are lemmatized, that is, we show only their lemma or root. For example,informal_learn for informal learning or datum for data. In few

Figure 1.Two conceptual maps extracted from the abstracts of the papers presented at the LAK conference Relationships reported in Table 2 further corroborate the observation that the LAK papers are more focused on teachers in order to empower them with learning analytics and to help them guide students. Moreover, there is an emphasis on (promoting) reflection of both students and instructors. Various aspects of social learning such as role playing and impact of communities appear to be highly popular topics in the LAK papers. On the other hand, EDM papers are much more focused on intelligent tutoring systems, accuracy of different types of (predictive) models, and revealing unexpected patterns. Certainly, focus on data is shared by both the LAK and EDM communities, but LAK also seems to be focused on data collected by and for instructors, not only for students. This probably indicates a trend that the LAK community has so far acknowledged the role of instructors in the learning process and aimed at supporting them as much as learners. The EDM community has however focused more on measuring and predicting specific types of skills. This is consistent with their focus on intelligent tutoring systems in which automated assessment of learners skills is of paramount importance. Finally, we were also able to visualize the extracted conceptual graphs. In Figure 1, we show the relationships of concept learning analytics as extracted from the abstracts of the papers presented at

the LAK conference. This figure further corroborates earlier o observations by indicating that learning analytics is an integral part of teaching profession, is an important step for teachers of tomo tomorrow and learners, and offers a new approach. This figure reveals also the nature of learning analytics to promote qualitative unde understanding of context ntext of information. Learning analytics is also

(strongly) related to discourse analytics, which seems to be conco sistent tent with the strong emphasis of learning analytics on social learning and which is further confirmed by extracted relationships of discourse e learning analytics with sense-making, sense argumentation and social, all of which are types of skills recognized as imporimpo tant for the modern society.

Figure 2. . Visualization of top 30 ranked concepts based on the majority voting scheme extracted from the abstracts of the LAK 2013 Challenge dataset. In future work, we plan to analyze further the research trends over the years for the LAK and EDM communities. Anot Another of our goals is to compare the extractions of an ontology learning system such as OntoCmaps with Linked data Semantic Annotators such as DBPedia Spotlight1 or Alchemy2. closest osest communities. More interesting results are available on our website [8]. For example, those results allow for (i) comparing results of different concept/relationship measures and (ii) chronochron logical trends emerging throughout the years of individual edied tions of both the conferences. An example xample of one of the visualizavisualiz tions available at [8] is presented in Figure 2. Of course, ontology learning tools are not perfectly accurate, and thus, few strange concepts and relationships are shown in our tables. An opportunity is however in combining such ontology learning tools as starting points of the concept map development of the learning analytics domain, which can then be refined through crowd sourcing (e.g., in a Wiki-like Wiki manner).

4. CONCLUSION
Funnily, our text analysis tool inferred that EDM is an abbrevi abbreviation of learning analytics. This probably comes from the open debate reflected in the analyzed papers about the relationships between learning analytics and educational data mining. We hope that this paper sheds some light on the (dis)similarities of the two areas. s. We also hope that our analysis of the LAK 2013 Data Cha Challenge dataset with the ontology learning tools indicated a high potential of this type of analytics to help the research community of new research discipline define itself and relationships with
1 2

5. REFERENCES
[1] Brin, S. & Page, L. (1998). The anatomy of a large-scale large hyper-textual textual web search engine, Stanford University. [2] De Marneffe, M-C, C, MacCartney, B. and Manning. C.D. (2006). Generating Typed Dependency Parses from Phrase Structure Parses. In Proc. of LREC, pp. 449-454, 449 ELRA.

https://github.com/dbpedia-spotlight/dbpedia-spotlight/ spotlight/ http://www.alchemyapi.com/

[3] JUNG (2013). Last retrieved from http://jung.sourceforge.net/ [4] Klein, D. and Manning, C.D. (2003). Accurate Unlexicalized Parsing. Proc. of the 41st Meeting of the Association for Computational Linguistics, pp. 423-430. [5] Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment, Journal of the ACM 46(5): 604-632, ACM. [6] Toutanova, K., Klein, D., Manning, C.D. & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network, In Proc. of HLT-NAACL, pp. 252-259. [7] http://www.solaresearch.org/mission/about/ [8] http://lakchallenge.co.nf [9] Van Rijsbergen, CornelisJoost (1979). Information Retrieval. London: Butterworths. ISBN 3-642-12274-4.

[10] Zouaq, A., Gasevic, D. and Hatala, M. (2011). Towards Open Ontology Learning and Filtering, Information Systems, 36(7): 10641081. [11] Zouaq, A., Gasevic, D. and Hatala, M. (2012a). Voting Theory for Concept Detection. The 9th Extended Semantic Web Conference 2012 (ESWC 2012), pp. 315-329. [12] Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora. In Proc.14th Conference on Computational Linguistics Vol. 2 (COLING '92), 539-545. [13] Taibi, D., Dietze, S., Fostering analytics on learning analytics research: the LAK dataset, Technical Report, 03/2013, URL: http://resources.linkededucation.org/2013/03/lak-datasettaibi.pdf.

Das könnte Ihnen auch gefallen