This is a brief introduction to a new research subject that I recently started working on. It serves as a case-study for very large EMF models and applying big data techniques to EMF, which is my current research subject. I also covered this subject in this talk:
Problem: Is Software Engineering a Science?
Science is defined as a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. But how testable are typical theses of software engineering:
- DSLs allow domain experts to develop software effectively and more efficiently as with GPLs.
- Static type systems lead to safer programming and fewer bugs.
- Functional programming leads to less performant programs.
- Scrum allows to develop programs faster.
- My framework allows to develop … more, faster … with less, fewer …
The reasons for a missing quantitative empirical research in software engineering are manifold and include issues like data quality, scalability, and heterogeneity. To elaborate on these issues, we should first look at the fields in software engineering that explicitly cover the empirical analysis of software.
Related Fields: Mining Software Repositories (MSR) and Metrics
Software repositories (i.e. source code repositories) contain more than source code. Market basket analysis style reasoning, e.g. “programmars than changed code X also changed code Y“, can be used to extract implicit dependencies from revision histories. This information (that is otherwise opaque) is used by traditional MSR  approaches to fix issues with individual repositories: (1) visualize implicit dependencies , (2) find or predict bugs [3,4], (3) identify architectural flaws, (4) or mine for API usage patterns .
The MSR community lacks a common technology that allows it to apply all developed techniques uniformly . Instead, individual teams seem to build their own proprietary systems that are then only applicable to a specific MST technique. Aside from apparent reasons like concrete repository software or dependencies towards specific programming languages (issue 1: abstractions), this is mainly caused do to the resource extensiveness of MSR. Therefore, only very specialized systems can provide the performance needed (issue 2: scalability).
Software metrics are used to measure certain properties of software (e.g. size, complexity) to assess costs (e.g. to maintain or develop software). Similar to MSR metrics are language dependent (issue 1: abstractions) and calculating metrics over the evolution of software (or many software projects) is computational expensive (issue 2: scalability)
The presented issues make it hard to apply MSR to large scale software repositories (repositories with 100-1000 projects, e.g. Apache, Eclipse). But, I believe that (if these issues are overcome) MSR can be applied in a larger context, where many projects are analysed to learn something about software engineering itself. Traditional software metrics and their evolution over revision history as well as new metrics that include implicit dependency information can be used to empirically analyse (1) engineering methodologies, (2) programming languages, or (3) API design (patterns).
Approach: A Framework for the Analysis of Large-Scale Software Repositories
Programming APIs for source code repositories, reverse engineering software code into models (i.e. AST-level models of code), and frameworks for persisting large models allow us to examine a software repository as meta-model based data (e.g. an EMF model). Our tool srcrepo [6,7] already does this. It uses jGit, MoDisco, and emf-fragments  to create AST-level models of the revision histories in git-repositories of eclipse projects. This framework could be extended for other languages and source code repositories due to its (meta-)model-based nature. This abstract can solve issue 1.
For a metrics based analysis of such source code models, we need techniques to effectively describe and execute aggregation queries. To navigate within the extracted data effectively, all queries need to be managed and all accumulated data has to be associated with its source. The (meta-)modeling community has a large variety of appropriate model transformation and query technologies in store.
Applying MSR to a large number of source code repositories requires a lot of computation time. The rational is that model persistence techniques and query languages can be identified/developed that allow us to execute MSR on large computing clusters that are governed by modern cloud-computing frameworks (e.g. hadoop). emf-fragments  already uses hadoop’s hbase to persist models in manageable chunks (fragments). It is reasonable that we can tailor a OCL-like language to execute on these fragments in a map/reduce fashion. This would solve issue 2.
Our framework srcrepo already allows us to create EMF-models from git repositories containing eclipse (Java) projects. The eclipse source repositories (git.eclipse.org) provide over 300 of such repositories, containing software projects of varying sizes, including eclipse itself.
To verify the conceptional soundness of a “model-based MSR”, we can apply existing MSR algorithms and techniques. Canditates are:
- 1.  Here implicit dependencies are used to identify cross cutting concerns in software repository. Measurements on many repositories could be used to reason about the effectiveness of AOP or refactoring techniques.
- 2.  Here the evolution of modularity in large code bases is analysed using Design Structure Matrices (DSM). The researches try to estimate the impact of refactoring efforts on the cohesion of modules.
Interesting research tracks
Metrics for revision histories
We have metrics for software code and software models, there are also fundamental metrics for software repositories. But there are no metrics that combine both. Especially are there no metrics that involve the impliciet dependencies hidden within source code repositories. Furthermore, with these dependencies metrics become uncertain and represent statistical processes and not exact numbers.
Comparing languages and methodologies
Language evangelist fight for decades over what is the “best” language and wich development process is the most efficient. MSR allows us to model development efforts precisely and more importantly promises to find the sources for avoidably costs or estimate the impact of certain tasks (e.g. refactoring). To correlate certain properties with used programming languages or methodologies, we need a large base of different software projects (open source) and the used techniques need to scale accordingly.
- Ahmed E. Hassan: The Road Ahead for Mining Software Repositories, 2008
- Thomas Zimmermann, Peter Weißgerber, Stephan Diehl, Andreas Zeller: Mining Version Histories to Guide Software Changes, 2005
- Nachiappan Nagappan, Thomas Ball, Andreas Zeller: Mining Metrics to Predict Component Failures, 2006
- Sunghun Kim, E. James Whitehead, Jr., Yi Zhang : Classifying Software Changes: Clean or Buggy? 2008
- CC Williams, JK Hollingsworth: Automatic mining of source code repositories to improve bug finding techniques
- Markus Scheidgen: Reference Representation Techniques for Large Models; BigMDE 2013
- Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, 2006
- Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs: An Empirical Study of Open Source and Proprietary Code, 2005
- Markus Scheidgen, Anatolij Zubow, Joachim Fischer, Thomas H. Kolbe: Automated and Transparent Model Fragmentation for Persisting Large Models; MODELS 2012, Wien.