[FGED-discuss] Notes from the morning of the Community Based Standards workshop (part 1)
stoeckrt at upenn.edu
Wed Feb 25 18:12:11 UTC 2015
Workshop on Community-Based Data and Metadata Standards Development
February 25-26, 2015
Phil Bourne starts us off with a sense of how standards development is an essential part of our data landscape. He described his own experience developing standards for 3D protein structure. And how this standards work fits into the broader BD2K vision of the digital ecosystem. Standards are important for this vision including the Commons and the Data discovery Index.He emphasized the importance of taking into consideration all the other standards work that is going on in various communities, including international efforts. RFI Standards Framework, Nov 2014 - informed by this workshop
ESIP is the Earth Science Information Partners Federation. See: http://esipfed.org/
We can do things better together than we can separately
ESIP both data and technology practitioners intermingled
single investigator project teams, to very large collaborative projects
ESIP is a network which includes other networks
Types of participants:
• Data centers (NASA DAACs), NOAA,
• Researchers and tool developers
• Application developers
• Strategic partners (NASA)
Clusters, which are agile rapid response teams. Some examples are:
ESIP provides community coordination to support interoperability at data, systems, human and organizational level. It is not a standards body but rather about best practices. The importance of the community coordination and information sharing is emphasized as being important to success.
Examples of clusters and their results were described.
Data spaces are like a wikipdedia for data.
Citation and identifiers are an example of an active cluster
2015 Dynamic data citation workshop
2015 Documentation Cluster - Encoding metadata groups/attributes in HDF (Hierarchical Data Format) and netCDF (network Common Data Format), common in earth sciences community.
2015 Semantic Web Cluster - Branding of Bioportal that runs on the cloud. ESIP is the governing body for vocab and ontology in the earth science community.
Workshop report: “Planning for a Community study of Scientific Data Infrastructure”
ToolMatch is an effort to link tools and data in a larger context.
Defining impact: Commitment of cross-sector actor to come together around a shared agenda and change their behaviour. ESIP is a backbone organization to facilitate this.
Standards activities are a technical activity. However, adoption is a social activity. As a “backbone organization”, ESIP brings both together from the start.
ESIP is agile, fast, flexible, collaborative, not competitive, viewed as a neutral organization that has shared agendas, not hidden agendas.
New technologies like Docker, Schema.org are being assessed. (Docker is a technology for containerization of software that abstracts away dependencies. It provides lightweight virtualization.)
Data button on your browser will be based on dataset indexing via DCAT schema.org extension. Benefit of using these standards for describing databases/ data resources.
ESIP is like a bus and you can get on and off as you need. There is a low barrier to entry.
Use cases are very important.
The “stack” of hardware, software, human infrastructure.
2015 ESIP meeting at Asilomar July 14-17
Q1. Has ESIP made progress on identifiers?
• Yes, and there are resources to help understand the properties of identifiers that are most suitable in various scenarios.
Q2: Phil: what hasn’t worked? what is the role of the funder to facilitate success?
• Some efforts were too programmed from the start with a particular agenda and did not really engage participants.
ecosystems take time, can’t engineer them, they have to form on their own accord
could fund some people to show up, and then see what happens, and then let go
strongly resisted line items of things they want done, have responded very well
Session One: Community-driven Processes for Developing Data and Metadata standards in the Basic Sciences and the Potential Impact(s) for Advancing NIH-relevant Research
“The nice thing about standards is that you have so many to choose from”
66 in bioportal / biosharing. some are obsolete, overlap
• no coordination in the early development of zebrafish, mouse, anatomy ontologies even though they share some set of anatomical features
• lots of pushback on developing uberon because it didn’t meet specific criteria from different standards perspectives, but there was real need to make all the anatomies interoperable
Uberon ontology tried to align the various other efforts for interoperability. There was initial skepticism.
Goal was one simple semantic framework
2012 brought all efforts together.
contributing to a central resource is highly challenging
Both technical and social as well as financial challenges.
He was part of OMG (Object Management Group). This group was led by vendors and was creator of UML (Unified Modeling Language).
Lessons learned from failure of OMG LSR
The FAQ cited use of CORBA as a reason for success but this standard also failed.
Too complex and based on technical aspects
Waterfall model did not work (lots of sequential steps but too easy to become obsolete and not be adopted). This model is standard software engineering and just does not apply well.
21st century knowledge workers revolution! This uses a different philosophy:
Software: Agile, eXtreme, JSON, IDEs.
Standards: Bottom up, community based, open ontologies
GitHub is a new paradigm for software development
-Distributed version control
-Graph based audit trail of every contribution
Does GitHub make sense for development for community standards? He thinks so. “We need a GitHub++ for vocabulary and ontology development. GitHub was designed for managing chunks of ASCII text. For a big complex interlocking thing like an ontology, we need extensions.’
He cites example of GA4GH (Global Alliance for Genomics and Health)
“We need GitHub++ for vocabulary and ontology development”
How do you balance peer development with coherent vision?One approach is to make map of overall territory
We need rewards for creating standards that are more widely recognized
Future proofing is a challenge
Q1: THere are hundreds of ontologies. How do we use these effectively given their extensive overlap?
• THere need to be bettter mechanisms to help people work together. There is both a social and technical need.
Three experiences shared as illustrative examples.
2001 MIAME (Minimal Information about a Microarray Experiment) created by MGED, But this just a checklist and does not address syntax and sematics. What worked is that it was simple and got wide buy in. However, only ~50% compliance. Journals do not enforce the requirement even when stated.
mid-2000s Ontologies work, OBO Foundry
“It is one thing to share data but quite another to share with metadata that is systematically processed.” Lots of buy in but too much dependence on volunteer time.
2014 NIAID GSCID/BRC project and Sample Application Standard
This is an application standard that takes advantage of existing standards. It has the advantage of base funding. Now trying to extend to clinical data.
Discussion of evolution of BioPAX and COMBINE which includes a number of groups coming together. But it is hard to get people to come together without incentives.
****Need to come back to the Idea that here we are looking for community or collective efforts that can have tangible incentives (for tenure, publications, grants, etc)
FCS- started in 1984 and has gone on for 31 years but this was all an unfunded mandate. The idea of incentives for these efforts is a recurring theme throughout this panel.
Issue: clinical metadata -- tooling to reuse the standards and get things to match up.
Issue: Data standards laborious and can be painful to apply. We need to show some clear outcome or benefit. If we require people to use standards, do it in a clear way -- Require standards to be submitted in a way, and make tools to reviewers so they can check compliance.
Importance of a validator that enforces the metadata standards.
Validators show what is wrong so the data can be corrected. They provide key information to editors and reviewers. Standards should be distributed with these validators.
We have a lot of overlapping standards. Can’t have 10 different standards where there could be one. And have to follow certain principles and be able to be quantifiiably tested. Therefore, an educational component, for example so people understand why is a data dictionary not a standard? Coordinate/ certify/ outreach to make and use standards/ ways to maintain and distribute standards. One example is that with MIAME they created MAGE-TAB and MAGE-ML to help implement it.
Michel: minimal standards talks to agile, but is it sufficient, and how are they evaluated wrt to BD2K community goals e.g. data discovery, data reuse, etc?
Melissa: As we go on, we should consider how we know whether a standard is useful? There have not been a lot of studies in that area.
Gary Bader - Pathway standards
PSI-MI (HUPO Proteomics Standards) This came out of mass spectrometry community
BioPAX (Biological Pathways Exchange) was funded by DOE and NIH and came out of database communities
Challenges: early OWL adopter which needed significant work, now more, more formats, how to structure controlled vocabularies etc. and people want to build on others ontologies, but it’s still not that easy. Still need for development tools.
There was fracture between theoretical computer scientists and database groups.
Challenge: social issues in the community with different goals/ approaches.
Challenge: volunteer energy is variable. (People do have day jobs/ lack of funding/ people leave).
COMBINE (Computational Modelling in Biology Network) came out of systems biology community and people who developed SBML. This is an effort to combine standards so that people can focus on data modeling.
“Being rewarded by papers tend to fragment community because you are rewarded for doing something new”
Q1: What did you mean by the challenge of nonexperts interfering. How is this not community engagement?
• It was a nonbiologist manager who wanted to determine how protein interactions were represented but this was a special interest of a particular database.
Theme: Community efforts valued, as opposed to valuing expert, single PI papers/efforts
Ryan Brinkman - Standards in flow cytometry
He described FCS experience over 31 years
Long development cycles because of variable volunteer effort.
How to measure success? -- people using it…. 7 standards done for data and metadata , e.g. FCS standard (1984 through multiple revisions), … how to incentivize people to do it? Funding an issue. Just writing a standard, about 100K to pay a writer to write the document. They’ve done 7 times - 800K. Long time lines.
2006 Start of standards development effort
2015 PLOS trying to address low compliance
“Loud experts” where everyone wants their say.
Volunteers make promises but do not deliver.
Key success factors
• Simple scope
• Very simple format (.csv with a few extra rules and additional semantics)
Funding with an intial NIH support but then all very ad hoc.
“Standards don’t come free”- need grant support mechanism to develop standards
One problem is that Standards don’t die. How do we move away from technologies for standards development that no longer work?
Q1 (michel). Rapid pace of change with new technologies, standards, and data types. COMBINE is interesting but at what point do things need to “die”? At what point are we done with something and need to move on?
• Standards do not die and that is part of the problem. There are standards that are very poor but have legacy users who are proponents.
• Legacy standards can be dealt with by transforms to the new. However hard to deal with XML standard if developing a JSON standard, for example, so we have to get better at this.
**Standards must be a paid activity.
How do you know when people are using it and to what extent? e.g. of BRO. Still in Bioportal. What are the downstream effects of saying obsolete and taking it down. Hard to put the stake in the heart of a standard.
MH: To evaluate standards, they have to be linked/related in some way to the data that uses them. This is also critical for knowing when a standard should die.
Anita?: Should be a clear owner. Likes Jessica’s plan of somebody gets paid to do this. Together with life cycle, owner would make a big difference.
Standards life cycle much longer than a grant cycle.
How do we make it easier for people to identify such initiatives! So that you don’t have to later try to put together multiple nascent standards.
Downside of ownership -- to … but maybe need code of conduct.
Ownership is good, but needs to be defined e.g. owner could set up the standard process and support it, but initiate a community process to develop the standard without interfering, but stepping up with leadership when needed e.g. grant applications, step in when things go wrong with the community process (but be very careful with this and hopefully apply it rarely, needs to be in line with community goals). This owner role could be defined (code of conduct) to support/facilitate standards development, not hinder it.
More information about the FGED-discuss