September 30, 2014
by Markus
0 comments

Model-Based Mining of Source Code Repositories

Abstract—The Mining Software Repositories (MSR) field analyzes the rich data available in source code repositories (SCR) to uncover interesting and actionable information about software system evolution. Major obstacles in MSR are the heterogeneity of software projects and the amount of data that is processed. Model-driven software engineering (MDSE) can deal with heterogeneity by abstraction as its core strength, but only recent efforts in adopting NoSQL-databases for persisting and processing very large models made MDSE a feasible approach for MSR. This paper is a work in progress report on srcrepo: a model-based MSR system. Srcrepo uses the NoSQL-based EMF-model persistence layer EMF-Fragments and Eclipse’s MoDisco reverse engineering framework to create EMF-models of whole SCRs that comprise all code of all revisions at an abstract syntax tree (AST) level. An OCL-like language is used as an accessible way to finally gather information such as software metrics from these SCR models.

KeywordsEMF, Mining Software Repositories, Metrics, OCL, Software Evolution

Download

@incollection{
 year={2014},
 isbn={978-3-319-11742-3},
 booktitle={System Analysis and Modeling: Models and Reusability},
 volume={8769},
 series={Lecture Notes in Computer Science},
 editor={Amyot, Daniel and Fonseca i Casas, Pau and Mussbacher, Gunter},
 doi={10.1007/978-3-319-11743-0_17},
 title={Model-Based Mining of Source Code Repositories},
 url={http://dx.doi.org/10.1007/978-3-319-11743-0_17},
 publisher={Springer International Publishing},
 author={Scheidgen, Markus and Fischer, Joachim},
 pages={239-254},
 language={English}
}

June 2, 2014
by Markus
0 comments

Reference Representation Techniques for Large Models

Abstract—If models consist of more and more objects, time and space required to process these models becomes an issue. To solve this we can employ different existing frameworks that use different model representations (e.g. trees in XMI or relational data with CDO). Based on the observation that these frameworks reach different performance measures for different operations and different model characteristics, we rise the question if and how different model representations can be combined to mitigate performance issues of individual representations.

In this paper, we analyze different techniques to represent references, which are one important aspect to process large models efficiently. We present the persistence framework EMF-Fragments, which combines the representation of references as source-object contained sets of target-objects (e.g. in XMI) within the representation as relations similar to those in relational databases (e.g. with CDO). We also present a performance evaluation for both representations and discuss the use of both representations in three applications: models for source-code repositories, scientific data, and geo-spatial data.

KeywordsEMF, persistence, databases

Download

@inproceedings{Scheidgen:2013:RRT:2487766.2487769,
 author = {Scheidgen, Markus},
 title = {Reference Representation Techniques for Large Models},
 booktitle = {Proceedings of the Workshop on Scalability in Model Driven Engineering},
 series = {BigMDE '13},
 year = {2013},
 isbn = {978-1-4503-2165-5},
 location = {Budapest, Hungary},
 pages = {5:1--5:9},
 articleno = {5},
 numpages = {9},
 url = {http://doi.acm.org/10.1145/2487766.2487769},
 doi = {10.1145/2487766.2487769},
 acmid = {2487769},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {EMF, big data, meta-modeling, mining software repositories, model persistence},
}

June 24, 2013
by Markus
0 comments

SrcRepo: A model-based framework for analyzing large scale software repositories

This is a brief introduction to a new research subject that I recently started working on. It serves as a case-study for very large EMF models and applying big data techniques to EMF, which is my current research subject. I also covered this subject in this talk:

Problem: Is Software Engineering a Science?

Science is defined as a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. But how testable are typical theses of software engineering:

    • DSLs allow domain experts to develop software effectively and more efficiently as with GPLs.
    • Static type systems lead to safer programming and fewer bugs.
    • Functional programming leads to less performant programs.
    • Scrum allows to develop programs faster.
    • My framework allows to develop … more, faster … with less, fewer

The reasons for a missing quantitative empirical research in software engineering are manifold and include issues like data quality, scalability, and heterogeneity. To elaborate on these issues, we should first look at the fields in software engineering that explicitly cover the empirical analysis of software.

Related Fields: Mining Software Repositories (MSR) and Metrics

Software repositories (i.e. source code repositories) contain more than source code. Market basket analysis style reasoning, e.g. “programmars than changed code X also changed code Y“, can be used to extract implicit dependencies from revision histories. This information (that is otherwise opaque) is used by traditional MSR [1] approaches to fix issues with individual repositories: (1) visualize implicit dependencies [2], (2) find or predict bugs [3,4], (3) identify architectural flaws, (4) or mine for API usage patterns [5].

The MSR community lacks a common technology that allows it to apply all developed techniques uniformly [1]. Instead, individual teams seem to build their own proprietary systems that are then only applicable to a specific MST technique. Aside from apparent reasons like concrete repository software or dependencies towards specific programming languages (issue 1: abstractions), this is mainly caused do to the resource extensiveness of MSR. Therefore, only very specialized systems can provide the performance needed (issue 2: scalability).

Software metrics are used to measure certain properties of software (e.g. size, complexity) to assess costs (e.g. to maintain or develop software). Similar to MSR metrics are language dependent (issue 1: abstractions) and calculating metrics over the evolution of software (or many software projects) is computational expensive (issue 2: scalability)

The presented issues make it hard to apply MSR to large scale software repositories (repositories with 100-1000 projects, e.g. Apache, Eclipse). But, I believe that (if these issues are overcome) MSR can be applied in a larger context, where many projects are analysed to learn something about software engineering itself. Traditional software metrics and their evolution over revision history as well as new metrics that include implicit dependency information can be used to empirically analyse (1) engineering methodologies, (2) programming languages, or (3) API design (patterns).

Approach: A Framework for the Analysis of Large-Scale Software Repositories

Programming APIs for source code repositories, reverse engineering software code into models (i.e. AST-level models of code), and frameworks for persisting large models allow us to examine a software repository as meta-model based data (e.g. an EMF model). Our tool srcrepo [6,7] already does this. It uses jGit, MoDisco, and emf-fragments [10] to create AST-level models of the revision histories in git-repositories of eclipse projects. This framework could be extended for other languages and source code repositories due to its (meta-)model-based nature. This abstract can solve issue 1.

For a metrics based analysis of such source code models, we need techniques to effectively describe and execute aggregation queries. To navigate within the extracted data effectively, all queries need to be managed and all accumulated data has to be associated with its source. The (meta-)modeling community has a large variety of appropriate model transformation and query technologies in store.

Applying MSR to a large number of source code repositories requires a lot of computation time. The rational is that model persistence techniques and query languages can be identified/developed that allow us to execute MSR on large computing clusters that are governed by modern cloud-computing frameworks (e.g. hadoop). emf-fragments [7] already uses hadoop’s hbase to persist models in manageable chunks (fragments). It is reasonable that we can tailor a OCL-like language to execute on these fragments in a map/reduce fashion.  This would solve issue 2.

First Case-Studies

Our framework srcrepo already allows us to create EMF-models from git repositories containing eclipse (Java) projects. The eclipse source repositories (git.eclipse.org) provide over 300 of such repositories, containing software projects of varying sizes, including eclipse itself.

To verify the conceptional soundness of a “model-based MSR”, we can apply existing MSR algorithms and techniques. Canditates are:

  • 1. [8] Here implicit dependencies are used to identify cross cutting concerns in software repository. Measurements on many repositories could be used to reason about the effectiveness of AOP or refactoring techniques.
  • 2. [9] Here the evolution of modularity in large code bases is analysed using Design Structure Matrices (DSM). The researches try to estimate the impact of refactoring efforts on the cohesion of modules.

Interesting research tracks

Metrics for revision histories

We have metrics for software code and software models, there are also fundamental metrics for software repositories. But there are no metrics that combine both. Especially are there no metrics that involve the impliciet dependencies hidden within source code repositories. Furthermore, with these dependencies metrics become uncertain and represent statistical processes and not exact numbers.

Comparing languages and methodologies

Language evangelist fight for decades over what is the “best” language and wich development process is the most efficient. MSR allows us to model development efforts precisely and more importantly promises to find the sources for avoidably costs or estimate the impact of certain tasks (e.g. refactoring). To correlate certain properties with used programming languages or methodologies, we need a large base of different software projects (open source) and the used techniques need to scale accordingly.

Bibliography

  1. Ahmed E. Hassan: The Road Ahead for Mining Software Repositories, 2008
  2. Thomas Zimmermann, Peter Weißgerber, Stephan Diehl, Andreas Zeller: Mining Version Histories to Guide Software Changes, 2005
  3. Nachiappan Nagappan, Thomas Ball, Andreas Zeller: Mining Metrics to Predict Component Failures, 2006
  4. Sunghun Kim, E. James Whitehead, Jr., Yi Zhang : Classifying Software Changes: Clean or Buggy? 2008
  5. CC Williams, JK Hollingsworth: Automatic mining of source code repositories to improve bug finding techniques
  6. Markus Scheidgen: Reference Representation Techniques for Large Models; BigMDE 2013
  7. http://github.com/markus1978/srcrepo
  8. Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, 2006
  9. Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs: An Empirical Study of Open Source and Proprietary Code, 2005
  10. Markus Scheidgen, Anatolij Zubow, Joachim Fischer, Thomas H. Kolbe: Automated and Transparent Model Fragmentation for Persisting Large Models; MODELS 2012, Wien.

June 24, 2013
by Markus
0 comments

Refactorings in Language Development with Asymmetric Bidirectional Model Transformations

Abstract—Software language descriptions comprise several heterogeneous interdependent artifacts that cover different aspects of languages (abstract syntax, notation and semantics). The dependencies between those artifacts demand the simultaneous adaptation of all artifacts when the language is changed. Changes to a language that do not change semantics are referred to as refactorings. This class of changes can be handled automatically by applying predefined types of refactorings. Refactorings are therefore considered a valuable tool for evolving a language.

We present a model transformation based approach for the refactoring of software language descriptions. We use asymmetric bidirectional model transformations to synchronize the various artifacts of language descriptions with a refactoring model that contains all elements that are changed in a particular refactoring. This allows for automatic, type-safe refactorings that also includes the language tooling. We apply this approach to an Ecore, Xtext, Xtend based language description and describe the implementation of a non-trivial refactoring.

KeywordsDSL evolution, language description, refactoring, bidirectional model transformations

Download

@inproceedings{DBLP:conf/sdl/SchmidtWSFK13,
  author    = {Martin Schmidt and
               Arif Wider and
               Markus Scheidgen and
               Joachim Fischer and
               Sebastian von Klinski},
  title     = {Refactorings in Language Development with Asymmetric Bidirectional
               Model Transformations},
  booktitle = {SDL Forum},
  year      = {2013},
  pages     = {222-238},
  ee        = {http://dx.doi.org/10.1007/978-3-642-38911-5_13},
  crossref  = {DBLP:conf/sdl/2013},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
@proceedings{DBLP:conf/sdl/2013,
  editor    = {Ferhat Khendek and
               Maria Toeroe and
               Abdelouahed Gherbi and
               Rick Reed},
  title     = {SDL 2013: Model-Driven Dependability Engineering - 16th
               International SDL Forum, Montreal, Canada, June 26-28, 2013.
               Proceedings},
  booktitle = {SDL Forum},
  publisher = {Springer},
  series    = {Lecture Notes in Computer Science},
  volume    = {7916},
  year      = {2013},
  isbn      = {978-3-642-38910-8},
  ee        = {http://dx.doi.org/10.1007/978-3-642-38911-5},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

June 24, 2013
by Markus
0 comments

EMF Modeling in Traffic Surveillance Experiments

Abstract—We use a wireless sensor network equipped with acceleration sensors to measure seismic waves caused by rolling traffic. In this paper, we report on our experiences in applying an EMF-based data infrastructure to these experiments. We built an experimentation infrastructure that replaces unstructured text-file based management of data with a model-based approach. We use EMF to represent sensor data and corresponding analysis results; we use an extension of EMF’s resource API to persist data in a database; and we use model transformations to describe data analysis. We conclude that a model based approach leads to safer, better documented, and more reproducible experiments.

KeywordsTraffic surveillance , Wireless Sensor Networks, EMF, Smart City

Download

@inproceedings{Scheidgen:2012:EMT:2491617.2491622,
 author = {Scheidgen, Markus and Zubow, Anatolij},
 title = {EMF modeling in traffic surveillance experiments},
 booktitle = {Proceedings of the Modelling of the Physical World Workshop},
 series = {MOTPW '12},
 year = {2012},
 isbn = {978-1-4503-1808-2},
 location = {Innsbruck, Austria},
 pages = {5:1--5:6},
 articleno = {5},
 numpages = {6},
 url = {http://doi.acm.org/10.1145/2491617.2491622},
 doi = {10.1145/2491617.2491622},
 acmid = {2491622},
 publisher = {ACM},
 address = {New York, NY, USA},
}

March 22, 2013
by Markus
0 comments

MAC Diversity in IEEE 802.11n MIMO Networks

Abstract—Opportunistic Routing (OR) is a novel routing technique for wireless mesh networks that exploits the broadcast nature of the wireless medium. OR combines frames from multiple receivers and therefore creates a form of Spatial Diversity, called MAC Diversity [1]. The gain from OR is especially high in networks where the majority of links has a high packet loss probability. The updated IEEE S02.11n standard improves the physical layer with the ability to use multiple transmit and receive antennas, i.e. Multiple-Input and Multiple-Output (MIMO), and therefore already offers spatial diversity on the physical layer, i.e. called Physical Diversity, which improves the reliability of a wireless link by reducing its error rate. In this paper we quantify the gain from MAC diversity as utilized by OR in the presence of PHY diversity as provided by a MIMO system like S02.11n. We experimented with an IEEE S02.11n indoor testbed and analyzed the nature of packet losses. Our experiment results show negligible MAC diversity gains for both interference-prone 2.4 GHz and interference-free 5 GHz channels when using 802.11n. This is different to the observations made with single antenna systems based on 802.11b/g [1], as well as in initial studies with S02.11n [2].

KeywordsIEEE 802.11n , MAC Diversity , Opportunistic Routing , PHY Diversity , Research , Testbed , Wireless Networks

@inproceedings{DBLP:conf/wd/ZubowSS12,
  author    = {Anatolij Zubow and
               Robert Sombrutzki and
               Markus Scheidgen},
  title     = {MAC diversity in IEEE 802.11n MIMO networks},
  booktitle = {Wireless Days},
  year      = {2012},
  pages     = {1-8},
  ee        = {http://dx.doi.org/10.1109/WD.2012.6402802},
  crossref  = {DBLP:conf/wd/2012},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
@proceedings{DBLP:conf/wd/2012,
  title     = {Proceedings of the IFIP Wireless Days Conference 2012,n,
               Ireland, November 21-23, 2012},
  booktitle = {Wireless Days},
  publisher = {IEEE},
  year      = {2012},
  isbn      = {978-1-4673-4402-9},
  ee        = {http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6387977},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

March 22, 2013
by Markus
0 comments

Map/Reduce on EMF Models

Abstract—Map/Reduce is the programming model in cloud computing. It enables the processing of data sets of unprecedented size, but it also delegates the handling of complex data structures completely to its users. In this paper, we apply Map/Reduce to EMF-based models to cope with complex data structures in the familiar an easy-to-use and type-safe EMF fashion, combining the advantages of both technologies. We use our framework EMF-Fragments to store very large EMF models in distributed key-value stores (Hadoop’s Hbase). This allows us to build Map/Reduce programs that use EMF’s generated APIs to process those very large EMF-models. We present our framework and two example Map/Reduce jobs for querying software models and for analyzing sensor data represented as EMF-models.

KeywordsEMF, big data, cloud computing, map/reduce, meta-modeling

 @inproceedings{Scheidgen:2012:MEM:2446224.2446231,
 author = {Scheidgen, Markus and Zubow, Anatolij},
 title = {Map/reduce on EMF models},
 booktitle = {Proceedings of the 1st International Workshop on Model-Driven Engineering for High Performance and CLoud computing},
 series = {MDHPCL '12},
 year = {2012},
 isbn = {978-1-4503-1810-5},
 location = {Innsbruck, Austria},
 pages = {7:1--7:5},
 articleno = {7},
 numpages = {5},
 url = {http://doi.acm.org/10.1145/2446224.2446231},
 doi = {10.1145/2446224.2446231},
 acmid = {2446231},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {EMF, big data, cloud computing, map/reduce, meta-modeling},
}

March 22, 2013
by Markus
0 comments

HWL — A high performance wireless sensor research network

Abstract—Current Wireless Sensor Networks (WSN) consist of low powered energy efficient nodes with short range radios and limited computation capabilities. Sensing applications with multiple sensors, high sample rates, and large spatial coverage are not realizable with conventional WSNs. We propose High Performance WSNs (HP-WSN) with an 802.11n based physical layer and opportunistic routing to overcome the limitations of existing WSNs. We present a research test-bed for such HP-WSNs. As an intermediate step towards actual HP-WSNs, our test-bed (HWL) collects raw data into a centralized data store to provide an experiment friendly environment. The collected raw data includes sensor data (to develop new sensing applications) and data about network and system operation (to develop the sensor network). We provide an example HP-WSN application, derive research objectives for the development of HP-WSNs, provide a test-bed architecture and present evaluation results on network and data storage performance to show the principle feasibility of HP-WSNs.

KeywordsDatabases; IEEE 802.11n Standard; Routing protocols; Wireless Sensor Networks; Test-Bed

@INPROCEEDINGS{6240552, 
author={Scheidgen, M. and Zubow, A. and Sombrutzki, R.}, 
booktitle={Networked Sensing Systems (INSS), 2012 Ninth International Conference on}, 
title={HWL #x2014; A high performance wireless sensor research network}, 
year={June}, 
pages={1-4}, 
keywords={Databases;IEEE 802.11n Standard;Routing;Routing protocols;Sensors;Wireless communication;Wireless sensor networks;IEEE 802.11n;Opportunistic Routing;Research Testbed;Wireless Sensor Networks}, 
doi={10.1109/INSS.2012.6240552}
}

March 22, 2013
by Markus
0 comments

Towards Smart Berlin – An Experimental Facility for Heterogeneous Smart City Infrastructures

Abstract—In this paper, we present the Smart Berlin Testbed as an infrastructure for experimental research on Smart City scenarios. As part of Smart Cities, applications will arise that build upon a variety of information sources and provide the user with near real-time information about the surrounding environment. An important role in this scenarios will be taken by wireless network infrastructures. They will function as interfaces to the users who connect with their smartphone or laptop to access the applications. Additionally, wireless and wired sensor networks are required to monitor the urban environment. The Smart City infrastructure needs to integrate these sensor networks and make them accessible and controllable for the applications. The set-up of the Smart Berlin Testbed is accomplished with the interconnection of two large wireless mesh and sensor networks in Berlin, namely the DES-Testbed at the Freie Universität Berlin and the HWL-Testbed at the Humboldt University Berlin. Together, both networks comprise 250 wireless multi-radio mesh routers and an amount of heterogeneous sensor nodes in the same order. We describe how we interconnect the testbeds via the Internet and the provided research possibilities resulting from the diverse network architectures. As a first experiment, we show results of a white space detection experiment that has been carried out in the Smart Berlin Testbed in order to assess the channel conditions at the testbed sites.

Keywordssmart city, wireless sensor networks, wireless mesh networks

@inproceedings{DBLP:conf/lcn/JuraschekZHSBSGF12,
  author    = {Felix Juraschek and
               Anatolij Zubow and
               Oliver Hahm and
               Markus Scheidgen and
               Bastian Blywis and
               Robert Sombrutzki and
               Mesut G{\"u}nes and
               Joachim Fischer},
  title     = {Towards Smart Berlin - an experimental facility for heterogeneous
               Smart City infrastructures},
  booktitle = {LCN Workshops},
  year      = {2012},
  pages     = {886-892},
  ee        = {http://dx.doi.org/10.1109/LCNW.2012.6424078},
  crossref  = {DBLP:conf/lcn/2012w},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
@proceedings{DBLP:conf/lcn/2012w,
  title     = {37th Annual IEEE Conference on Local Computer Networks,
               Workshop Proceedings, Clearwater Beach, FL, USA, October
               22-25, 2012},
  booktitle = {LCN Workshops},
  publisher = {IEEE},
  year      = {2012},
  isbn      = {978-1-4673-2130-3},
  ee        = {http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6415465},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}