March 23, 2017
by Markus

Creating and Analyzing Source Code Repository Models – A Model-based Approach to Mining Software Repositories

Abstract—With mining software repositories (MSR), we analyze the rich data created during the whole evolution of one or more software projects. One major obstacle in MSR is the heterogeneity and complexity of source code as a data source. With model-based technology in general and reverse engineering in particular, we can use abstraction to overcome this obstacle. But, this raises a new question: can we apply existing reverse engineering frameworks that were designed to create models from a single revision of a software system to analyze all revisions of such a system at once? This paper presents a framework that uses a combination of EMF, the reverse engineering framework Modisco, a NoSQL-based model persistence framework, and OCL-like expressions to create and analyze fully resolved AST-level model representations of whole source code repositories. We evaluated the feasibility of this approach with a series of experiments on the Eclipse code-base.

KeywordsEMF, MSR, Reverse Engineering, Large Models


Download Paper

Official Publication

SrcRepo at GitHub


author={Markus Scheidgen and Martin Smidt and Joachim Fischer},
title={Creating and Analyzing Source Code Repository Models - A Model-based Approach to Mining Software Repositories},
booktitle={Proceedings of the 5th International Conference on Model-Driven Engineering and Software Development - Volume 1: MODELSWARD,},

December 16, 2016
by Markus

Evaluation of Model Comparison for Delta-Compression in Model Persistence

Abstract—Model-based software engineering is applied to more and more complex software systems. As a result, larger and larger mod- els with longer and longer histories have to be maintained and per- sisted. Already, a lot of research efforts went into model versioning, comparison, and repositories. Existing strategies either record and per- sist changes (change-based repositories, e.g. EMF-Store) or relay on ex- isting text-based version control systems to persist whole model revi- sions (state-based repositories). Both approaches have advantages and disadvantages. We suggest a hybrid approach that infers changes via comparison to persist delta-compressed model states. Our hypothesis is that delta-compression requires a trade-off between comparison quality and execution time. Existing model comparison frameworks are tailored for comparison quality and not necessarily execution time performance. Therefore, we evaluate and compare traditional line-based comparison, an existing model comparison framework (EMF-Compare), and our own framework (EMF-Compress). We reverse engineered the Eclipse code- base and it’s history with MoDisco to create a large corpus of evolving example models for our experiments.

KeywordsEMF, Persistence, Compression, Model Comparison


Download Paper
emf-compress at GitHub


  author    = {Markus Scheidgen},
  title     = {Evaluation of Model Comparison for Delta-Compression in Model Persistence},
  booktitle = {4th International Workshop on {BigMDE},
               Held as Part of {STAF} 2016, Vienna, Austria, July 6-7,
               2016, Proceedings},
  year      = {2016}

December 16, 2016
by Markus

Metamodeling vs Metaprogramming: A Case Study on Developing Client Libraries for REST APIs

Abstract—Web-services with REST APIs comprise the majority of the programmable web. To access these APIs more safely and conveniently, language specific client libraries can hide REST details behind regular programming language idioms. Manually building such libraries is straightforward, but tedious and error prone. Fortunately, model-based development provides different methods to automate their development. In this paper, we present our experiences with two opposing approaches to describe existing REST APIs and to generate type-safe client side Java libraries from these descriptions. First, we use an EMF-metamodel and a code generator (external DSL). Secondly, we use the Java compatible language Xtend and its metaprogramming mechanism active annotations, which allows us to alter the semantics of existing Xtend constructs to describe REST APIs within Xtend (internal DSL). Furthermore, we present related approaches and discuss our findings comparatively.

KeywordsEMF, Xtend, REST


Download Paper
xraw at GitHub


  author    = {Markus Scheidgen and
               Sven Efftinge and
               Frederik Marticke},
  title     = {Metamodeling vs Metaprogramming: {A} Case Study on Developing Client
               Libraries for {REST} APIs},
  booktitle = {Modelling Foundations and Applications - 12th European Conference,
               {ECMFA} 2016, Held as Part of {STAF} 2016, Vienna, Austria, July 6-7,
               2016, Proceedings},
  pages     = {205--216},
  year      = {2016},
  crossref  = {DBLP:conf/ecmdafa/2016},
  url       = {},
  doi       = {10.1007/978-3-319-42061-5_13},
  timestamp = {Thu, 23 Jun 2016 13:59:24 +0200},
  biburl    = {},
  bibsource = {dblp computer science bibliography,}

  editor    = {Andrzej Wasowski and
               Henrik L{\"{o}}nn},
  title     = {Modelling Foundations and Applications - 12th European Conference,
               {ECMFA} 2016, Held as Part of {STAF} 2016, Vienna, Austria, July 6-7,
               2016, Proceedings},
  series    = {Lecture Notes in Computer Science},
  volume    = {9764},
  publisher = {Springer},
  year      = {2016},
  url       = {},
  doi       = {10.1007/978-3-319-42061-5},
  isbn      = {978-3-319-42060-8},
  timestamp = {Thu, 23 Jun 2016 13:57:01 +0200},
  biburl    = {},
  bibsource = {dblp computer science bibliography,}

October 9, 2015
by Markus

XRaw – JSON and REST without Boilerplate

Check XRaw on GitHub for source code and an up-to date version of this article.

If you, as a Java programmer, think that plain JSON is just not type-safe enough, or that Jackson and co are just to heavy and stiff to be compatible with your Scala or xTend coding style, you should read on.

XRaw provides a set of active annotations that simplifies the development of type-safe Java wrapper for JSON data, RESTful API calls, and MongoDB interfaces. Providing helpful features to create social media aware apps and backends with Java (and xTend).

Active annotations are an xTend feature that allows us to semantically enrich simple data objects declarations with functionality that transparantly (un-)marshalles Java to JSON data, encodes REST requests, or accesses a database.

JSON example

The following small xTend file demonstrates the use of XRaw annotations to create wrapper-types for some typical JSON data:

@JSON class Library {
  List<Book> books
  String adress
  @Name("count") int entries

@JSON class Book {
  String title
  String isbn
  List<String> authors
  @WithConverter(UtcDateConverter) Date publish_date

Based on this data description, we can now simply use the class Library to wrap corresponing JSON data into Java POJOs:

val library = new Library(new JSONObject('''{
  books : [
      title: "Pride and Prejudice",
      authors: "Jane Austin",
      isbn: "96-2345-33123-32"
      publish_date: "1813-04-12T12:00:00Z"
      title: "SAP business workflow",
      authors: "Ulrich Mende, Andreas Berthold",

  adress: "Unter den Linden 6, 1099 Berlin, Germany"
  count: 2

For example, we can use xTend to find all “old” books:

val oldBooks = library.books.filter[it.publishDate.year < 1918]

Since xText compiles to Java, we can also use the wrapper types in Java programs:

public long coutBooksBefore(Library library, int year) {
  return library.getBooks().stream().filter(book->book.getPublishDate().getYear() < year).count();

REST example

This is a simple “script” written based on a Twitter API wrapper created with XRaw.

// For starters, use XRawScript to interactively create a Twitter instance 
// with a pre-configured HTTPService that deals with all OAuth related issues.
val twitter = XRawScript::get("data/store.json", "markus", Twitter)

// REST API endpoints are structured and accessed via fluent interface
val userTimelineReq = twitter.statuses.userTimeline

// Parameters can also be added fluently.

// xResult will execute the request and wrap the returned JSON data.
val userTimelineResult = userTimelineReq.xResult

// Use xTend and its iterable extensions to navigate the results.
userTimelineResult.filter[it.retweetCount > 4].forEach[

// Or as a "one liner".
    .filter[it.retweetCount > 4].forEach[println(it.text)]

This is written in xTend. You could also use Scala, Java or any other JVM/bytecode based language.

Get started

git clone xraw
cd xraw/de.scheidgen.xraw/
mvn compile

Look at the examples.


XRaw is early in development. There is no release yet; XRaw is not available via maven central yet.


  • JSON
    • wrapper for existing JSON data or to create new JSON
    • support for primitive values, arrays, objects
    • converter to convert complex type to and from string
    • different key names in JSON and Java to adopt to existing code
  • REST
    • wrapper for GET and POST requests
    • with URL and body parameters
    • with parameters encoded in URL path
    • with array and object JSON results
    • customizable HTTP implementation, e.g. to integrate with existing signing and OAuth solutions
    • customizable respone types, e.g. to use API specific data passed through HTTP header, HTTP status codes, etc.
  • MongoDB
    • simple databases wrapper for uni-types collections of JSON data


I need you to try XRaw, check the existing snippets of API (we have some twitter, facebook, youtube, twitch, tumblr). Tell us what works, what doesn’t. What annotations do you need.

August 13, 2015
by Markus

Generation of Random Software Models for Benchmarks

Abstract—Since model driven engineering (MDE) is applied to larger and more complex system, the memory and execution time performance of model processing tools and frameworks has become important. Benchmarks are a valuable tool to evaluate performance and hence assess scalability. But, benchmarks rely on reasonably large models that are unbiased, can be shaped to distinct use-case scenarios, and are ”real” enough (e.g. non-uniform) to cause real-world behavior (especially when mechanisms that exploit repetitive patterns like caching, compression, JIT-compilation, etc. are involved). Creating large models is expensive and erroneous, and neither existing models nor uniform synthetic models cover all three of the wanted properties. In this paper, we use randomness to generate unbiased, non-uniform models. Furthermore, we use distributions and parametrization to shape these models to simulate different use-case scenarios. We present a meta-model-based framework that allows us to describe and create randomly generated models based on a meta-model and a description written in a specifically developed generator DSL. We use a random code generator for an object-oriented programming language as case study and compare our result to non-randomly and synthetically created code, as well as to existing Java-code.

KeywordsEMF, Benchmarks, Generation, Large models


Download Paper
RandomEMF at GitHub


  author = {Scheidgen, Markus},
  booktitle = {Proceedings of the 3rd Workshop on Scalable Model Driven Engineering},
  editor = {Kolovos, Dimitris S and Ruscio, Davide Di and Matragkas, Nicholas and Cuadrado, Jes\'{u}s  S\'{a}nchez and Rath, Istvan and Tisi, Massimo},
  pages = {1--10},
  publisher = {CEUR},
  title = {{Generation of Large Random Models for Benchmarking}},
  url = {},
  year = {2015}

September 30, 2014
by Markus

Model-Based Mining of Source Code Repositories

Abstract—The Mining Software Repositories (MSR) field analyzes the rich data available in source code repositories (SCR) to uncover interesting and actionable information about software system evolution. Major obstacles in MSR are the heterogeneity of software projects and the amount of data that is processed. Model-driven software engineering (MDSE) can deal with heterogeneity by abstraction as its core strength, but only recent efforts in adopting NoSQL-databases for persisting and processing very large models made MDSE a feasible approach for MSR. This paper is a work in progress report on srcrepo: a model-based MSR system. Srcrepo uses the NoSQL-based EMF-model persistence layer EMF-Fragments and Eclipse’s MoDisco reverse engineering framework to create EMF-models of whole SCRs that comprise all code of all revisions at an abstract syntax tree (AST) level. An OCL-like language is used as an accessible way to finally gather information such as software metrics from these SCR models.

KeywordsEMF, Mining Software Repositories, Metrics, OCL, Software Evolution


 booktitle={System Analysis and Modeling: Models and Reusability},
 series={Lecture Notes in Computer Science},
 editor={Amyot, Daniel and Fonseca i Casas, Pau and Mussbacher, Gunter},
 title={Model-Based Mining of Source Code Repositories},
 publisher={Springer International Publishing},
 author={Scheidgen, Markus and Fischer, Joachim},

June 2, 2014
by Markus

Reference Representation Techniques for Large Models

Abstract—If models consist of more and more objects, time and space required to process these models becomes an issue. To solve this we can employ different existing frameworks that use different model representations (e.g. trees in XMI or relational data with CDO). Based on the observation that these frameworks reach different performance measures for different operations and different model characteristics, we rise the question if and how different model representations can be combined to mitigate performance issues of individual representations.

In this paper, we analyze different techniques to represent references, which are one important aspect to process large models efficiently. We present the persistence framework EMF-Fragments, which combines the representation of references as source-object contained sets of target-objects (e.g. in XMI) within the representation as relations similar to those in relational databases (e.g. with CDO). We also present a performance evaluation for both representations and discuss the use of both representations in three applications: models for source-code repositories, scientific data, and geo-spatial data.

KeywordsEMF, persistence, databases


 author = {Scheidgen, Markus},
 title = {Reference Representation Techniques for Large Models},
 booktitle = {Proceedings of the Workshop on Scalability in Model Driven Engineering},
 series = {BigMDE '13},
 year = {2013},
 isbn = {978-1-4503-2165-5},
 location = {Budapest, Hungary},
 pages = {5:1--5:9},
 articleno = {5},
 numpages = {9},
 url = {},
 doi = {10.1145/2487766.2487769},
 acmid = {2487769},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {EMF, big data, meta-modeling, mining software repositories, model persistence},

June 24, 2013
by Markus

SrcRepo: A model-based framework for analyzing large scale software repositories

This is a brief introduction to a new research subject that I recently started working on. It serves as a case-study for very large EMF models and applying big data techniques to EMF, which is my current research subject. I also covered this subject in this talk:

Problem: Is Software Engineering a Science?

Science is defined as a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. But how testable are typical theses of software engineering:

    • DSLs allow domain experts to develop software effectively and more efficiently as with GPLs.
    • Static type systems lead to safer programming and fewer bugs.
    • Functional programming leads to less performant programs.
    • Scrum allows to develop programs faster.
    • My framework allows to develop … more, faster … with less, fewer

The reasons for a missing quantitative empirical research in software engineering are manifold and include issues like data quality, scalability, and heterogeneity. To elaborate on these issues, we should first look at the fields in software engineering that explicitly cover the empirical analysis of software.

Related Fields: Mining Software Repositories (MSR) and Metrics

Software repositories (i.e. source code repositories) contain more than source code. Market basket analysis style reasoning, e.g. “programmars than changed code X also changed code Y“, can be used to extract implicit dependencies from revision histories. This information (that is otherwise opaque) is used by traditional MSR [1] approaches to fix issues with individual repositories: (1) visualize implicit dependencies [2], (2) find or predict bugs [3,4], (3) identify architectural flaws, (4) or mine for API usage patterns [5].

The MSR community lacks a common technology that allows it to apply all developed techniques uniformly [1]. Instead, individual teams seem to build their own proprietary systems that are then only applicable to a specific MST technique. Aside from apparent reasons like concrete repository software or dependencies towards specific programming languages (issue 1: abstractions), this is mainly caused do to the resource extensiveness of MSR. Therefore, only very specialized systems can provide the performance needed (issue 2: scalability).

Software metrics are used to measure certain properties of software (e.g. size, complexity) to assess costs (e.g. to maintain or develop software). Similar to MSR metrics are language dependent (issue 1: abstractions) and calculating metrics over the evolution of software (or many software projects) is computational expensive (issue 2: scalability)

The presented issues make it hard to apply MSR to large scale software repositories (repositories with 100-1000 projects, e.g. Apache, Eclipse). But, I believe that (if these issues are overcome) MSR can be applied in a larger context, where many projects are analysed to learn something about software engineering itself. Traditional software metrics and their evolution over revision history as well as new metrics that include implicit dependency information can be used to empirically analyse (1) engineering methodologies, (2) programming languages, or (3) API design (patterns).

Approach: A Framework for the Analysis of Large-Scale Software Repositories

Programming APIs for source code repositories, reverse engineering software code into models (i.e. AST-level models of code), and frameworks for persisting large models allow us to examine a software repository as meta-model based data (e.g. an EMF model). Our tool srcrepo [6,7] already does this. It uses jGit, MoDisco, and emf-fragments [10] to create AST-level models of the revision histories in git-repositories of eclipse projects. This framework could be extended for other languages and source code repositories due to its (meta-)model-based nature. This abstract can solve issue 1.

For a metrics based analysis of such source code models, we need techniques to effectively describe and execute aggregation queries. To navigate within the extracted data effectively, all queries need to be managed and all accumulated data has to be associated with its source. The (meta-)modeling community has a large variety of appropriate model transformation and query technologies in store.

Applying MSR to a large number of source code repositories requires a lot of computation time. The rational is that model persistence techniques and query languages can be identified/developed that allow us to execute MSR on large computing clusters that are governed by modern cloud-computing frameworks (e.g. hadoop). emf-fragments [7] already uses hadoop’s hbase to persist models in manageable chunks (fragments). It is reasonable that we can tailor a OCL-like language to execute on these fragments in a map/reduce fashion.  This would solve issue 2.

First Case-Studies

Our framework srcrepo already allows us to create EMF-models from git repositories containing eclipse (Java) projects. The eclipse source repositories ( provide over 300 of such repositories, containing software projects of varying sizes, including eclipse itself.

To verify the conceptional soundness of a “model-based MSR”, we can apply existing MSR algorithms and techniques. Canditates are:

  • 1. [8] Here implicit dependencies are used to identify cross cutting concerns in software repository. Measurements on many repositories could be used to reason about the effectiveness of AOP or refactoring techniques.
  • 2. [9] Here the evolution of modularity in large code bases is analysed using Design Structure Matrices (DSM). The researches try to estimate the impact of refactoring efforts on the cohesion of modules.

Interesting research tracks

Metrics for revision histories

We have metrics for software code and software models, there are also fundamental metrics for software repositories. But there are no metrics that combine both. Especially are there no metrics that involve the impliciet dependencies hidden within source code repositories. Furthermore, with these dependencies metrics become uncertain and represent statistical processes and not exact numbers.

Comparing languages and methodologies

Language evangelist fight for decades over what is the “best” language and wich development process is the most efficient. MSR allows us to model development efforts precisely and more importantly promises to find the sources for avoidably costs or estimate the impact of certain tasks (e.g. refactoring). To correlate certain properties with used programming languages or methodologies, we need a large base of different software projects (open source) and the used techniques need to scale accordingly.


  1. Ahmed E. Hassan: The Road Ahead for Mining Software Repositories, 2008
  2. Thomas Zimmermann, Peter Weißgerber, Stephan Diehl, Andreas Zeller: Mining Version Histories to Guide Software Changes, 2005
  3. Nachiappan Nagappan, Thomas Ball, Andreas Zeller: Mining Metrics to Predict Component Failures, 2006
  4. Sunghun Kim, E. James Whitehead, Jr., Yi Zhang : Classifying Software Changes: Clean or Buggy? 2008
  5. CC Williams, JK Hollingsworth: Automatic mining of source code repositories to improve bug finding techniques
  6. Markus Scheidgen: Reference Representation Techniques for Large Models; BigMDE 2013
  8. Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, 2006
  9. Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs: An Empirical Study of Open Source and Proprietary Code, 2005
  10. Markus Scheidgen, Anatolij Zubow, Joachim Fischer, Thomas H. Kolbe: Automated and Transparent Model Fragmentation for Persisting Large Models; MODELS 2012, Wien.

June 24, 2013
by Markus

Refactorings in Language Development with Asymmetric Bidirectional Model Transformations

Abstract—Software language descriptions comprise several heterogeneous interdependent artifacts that cover different aspects of languages (abstract syntax, notation and semantics). The dependencies between those artifacts demand the simultaneous adaptation of all artifacts when the language is changed. Changes to a language that do not change semantics are referred to as refactorings. This class of changes can be handled automatically by applying predefined types of refactorings. Refactorings are therefore considered a valuable tool for evolving a language.

We present a model transformation based approach for the refactoring of software language descriptions. We use asymmetric bidirectional model transformations to synchronize the various artifacts of language descriptions with a refactoring model that contains all elements that are changed in a particular refactoring. This allows for automatic, type-safe refactorings that also includes the language tooling. We apply this approach to an Ecore, Xtext, Xtend based language description and describe the implementation of a non-trivial refactoring.

KeywordsDSL evolution, language description, refactoring, bidirectional model transformations


  author    = {Martin Schmidt and
               Arif Wider and
               Markus Scheidgen and
               Joachim Fischer and
               Sebastian von Klinski},
  title     = {Refactorings in Language Development with Asymmetric Bidirectional
               Model Transformations},
  booktitle = {SDL Forum},
  year      = {2013},
  pages     = {222-238},
  ee        = {},
  crossref  = {DBLP:conf/sdl/2013},
  bibsource = {DBLP,}
  editor    = {Ferhat Khendek and
               Maria Toeroe and
               Abdelouahed Gherbi and
               Rick Reed},
  title     = {SDL 2013: Model-Driven Dependability Engineering - 16th
               International SDL Forum, Montreal, Canada, June 26-28, 2013.
  booktitle = {SDL Forum},
  publisher = {Springer},
  series    = {Lecture Notes in Computer Science},
  volume    = {7916},
  year      = {2013},
  isbn      = {978-3-642-38910-8},
  ee        = {},
  bibsource = {DBLP,}