[FGED-discuss] Notes from the Community Based Standards workshop Day 2

Chris Stoeckert stoeckrt at upenn.edu
Thu Feb 26 19:18:45 UTC 2015


Day 2
Session 6: Tools and Rules: Technical Hurdles

James Overton--user-focused tools
Standards are about communicating clearly and thinking clearly
Everyone has their specialization… not everyone can see all the complexity and layers. Give people good tools that solve problems. Help people think more clearly. Not making a tool you have to fight-important.
Small number of tools with right level of abstraction--community-wide tools. Let scientists do the work they need to do. Ergonomics--good human interaction; good error codes; work efficiently. Need to be able to maintain long-term.
Community-wide tools, not project specific tools allow non-specialists/non-programmers to engage with data/semantics
Tool development also does not fit grant cycle well. Even open-source needs experts and continuity of development. Simply because tools are open source, doesn’t solve problem of maintaining and supporting by the community.

Tom Oniki
Adoption of standards and implementations--needs tools.
Either need someone who can consult or you need smart tools.
Can we build tools that isolate need for knowledge
New view for exception handling-- learning framework (logging, response, application)
It shouldn’t be too problematic to implement models that allow for extensions and framework for “exceptions”. Maybe NIH needs to also fund projects to build tools and help people adopt standards

Anand Basu - (not present, but slide presented by Eric Neumann)
	• Registry/Yellow Pages for Standards
		• Where do I go to find out which standard should I use in data collection? e.g. Should I use SNOMED-CT or LOINC or ICD-10?
	• Metadata repositories -- which to use?  caDSR, USHIK, PhinVADs, USHIK’s website?
	• Can we have open source, easy to use, vacabulary/terminology mapping tools so that data collected in one standard can be automatically converted into another?

Mike Feolo- Observations from dbGAP -
Phenotypes rarely standardized in dbGAP (only eMERGE study has linked back to controlled vocabulary).
Mapping to PhenX but only in eMERGE
	• requires human curation and limited by granularity of the vocabulary
How dbGAP addresses this:
	• Allows each analyst to harmonize data at granularity desired
	• However, forces harmonization
Need to have better curation upstream before deposition
Post hoc annotation difficult, best to do the annotation up front, putting it into the data plan.
dbGAP uses the de Facto standard BAM file, but processes it into an SRA file that is smaller and has location information.  (SRA much more compact).
However, prior processing of VCF not captured in a standard way, though you can put it in the header.  But the metadata not standard or always used.  This has emerged as an issue.

Ryan Brinkman
Writing technical documents on data interchange standards not as difficult to do. Don’t need tools for that.

Discussion--Session 6
dBGAP discussion - around annotating with terminology, and the granularity of use -- example of “I take blood pressure in a very specific way”.   We don’t want to pre-annotate. Worried about post-hoc annotation. But Melissa brought up point of indexing for search vs annotating for reuse.  

Generalization: need to collaborate to define upstream tools to aid data deposition. TOOL REQUIREMENT

Searching on dbGAP would be easier if things came in standardized.
Recommendation: build better submission tool, collaborate with community on requirements and testing.

Warren:  notes different levels of granularity of specification in arrayExpress, dbGAP and Geo -- but he says you need to use SOME kind of code.  Array Express makes sure that people submit something.  Maybe not good enough, but something.  Even with things like male/ female -- there are multiple ways of representing it.  Balancing act - burden on submitters, how easy is it to find/ use on the other end.  Room to get somewhere better.

Jessie- these things are not mutually exclusive, we can get scruffy with some high level metadata, and still be specific.  SO, maybe NIH needs to mandate that the simple metadata should be included.

exception handling - need tools to say what do you do when there are problems.  Easy to throw an error, but hard to throw an error with a message that says this is how to fix it.  Difficult and expensive for curators.
Is there a way to make data sets combine or “play together?”
OBO has developed best practices to help do this. Helps solve some problems at the low level.
Elaine: Clean real truth in data never exists and the science changes.  But how can we build tools that move with the science?
Phil: Should we do challenges for data authentication tools?  That could help bring attention to these issues.

MH recommendation: we need tools to feedback to the data providers what has been done with their data, what enhancement to tools have been made, otherwise its a one way street
Great idea from Michel Dumontier - have data owners link various data improvement efforts to original data to be leveraged by others

Melissa: There are social and technical problems.  How can we build tools to help , modular way of feeding content back and alerting people?… instead of doing by email to fill in the gaps

TCA data example- reconfigure the data, and then this was made available in dbGAP - made new study data set

NIH could pay someone to go through GWAS data
could do the same for phenoytpes

Session 7: Break out Groups

Group 1
Facilitator: Todd Carpenter
Rapporteurs: Sheri Schully/Ravi Ravichandran

People don’t get recognized for maintenance and infrastructure building in the current model. NIH needs to fund the boring part- Maintenance and developing meta-structure. No one gets credit in their community for making a shared framework/scaffolding.

Rethink how these studies are reviewed at study section so that infrastructure grants are funded. Maybe a stand alone study section or an RFA that targets infrastructure building.
Not development of new standards but improvement or expansion of existing standards. “Friends don’t let friends develop new standards.”

Will this be led by NIH or the extramural community? Think about what NIH did for PubMed…. Has created a standardized infrastructure for publications.

Ex: NSF datanet project
7-8 years ago NSF funded 2 large networks (New Mexico, Hopkins).
Repository nodes (data repositories, metadata structures) for organism biology and astrophysics.  These projects were building infrastructure and data repositories. They have also been instrumental in bringing the community together. Self organized group who work together on a data system as well as tools to make the data useful.

Targeted Recommendations:
	• Start referring to the field as something else that is more appealing  (ex: “information engineering”)
	• Get new study section or revamp current system so that these grants are more appealing
	• Maybe change the grant submission to have a proposal period before whole proposal
	• Data and Metadata Guides and develop metrics to understand what is out there and how to use it (maybe link up with bioshare to do this?)
	• No one is currently funding infrastructures that will connect areas that are not currently connected- NIH should take the lead on this
	• Funding to enhance, expand and improve existing standards- people need credit to do this.

Group 2
Facilitator: Ryan Brinkman
Rapporteurs: Lesley Skalla
               
Gaps in data and metadata standards development processes.
Areas of greatest need and impact for data and metadata standards development.
                                               
What constitutes due diligence for assessing the existing data standards landscape?
	• Need index of standards (like current index of vocabularies) so users can assess what is available for use.
	• The problem are standards that come in different formats and are located in disparate places. Users do not know where to search for content.
	• Higher level standards content needs to be indexed at least so it can be found.
	• Maybe publishers can have a standards resource issue available for users to utilize.

What factors should be considered in prioritizing needs for development of data and metadata standards?
	• a new technology may spur the need for a new standard.
	• amount of data and presence of existing standards.
	• try to engage with software vendors developing that technology. How incentivize the instrument vendors? It is not their priority but a standards group needs to get involved.
	• Recommendation- have a standards body the vendors can work with, perhaps facilitated by FDA as they will know about this soonest.
	• grantees need to tell NIH why you cannot use an accepted standard if they are not going to. Otherwise must indicate which are relevant and that they would use. If none available should state and feedback to NIH should be tracked

How do you determine the potential or eventual impact of a data standard?
	• Gauge the size of the community that is going to implement standard-how large and how sustainable that community is (defined as people who contribute to the effort). The % of people who participate in the effort may be better metric.
	• Need a signatory member of the community to represent the standard. You need to demonstrate that the effort is worthwhile. Could utilize societies to garner support for standard.
	• Prediction of success-how much support will be available for the standard and the complexity of the standard. Who are affected, who are involved and who will commit resources.
	• Should have commitment by 2 different pilot implementations
	• Sometimes potential is determined by the need for the standard.
	• Identify all possible stakeholders to weigh in on the standard--publishers, vendors, etc.

What specific use cases represent urgent and unmet needs for data and metadata standards?
	• Use case for CEDAR center of excellence- the annotation of immunological data (ImmPort) is difficult. Using text data mining now.  Take any database where data needs annotation and use as use case.
	• There is a lot of heterogeneity in way people applying annotations.  A little bit is better than nothing. Unmet need is to have templates for common use cases-templates to guide people and provide limited choice of annotations.
	• Gap in annotation innovation- statistical results not well documented. Algorithms run but no documentation provided.
	• Need:  Reporting of statistical results and how bring that into the tools. (=reproducible research)   
                  
Second Breakout Group

Group 1:
Facilitator: Eva Huala
Rapporteur: Yaffa Rubinstein/Sheri Schully

Who are the “stakeholders”?
	• Data Submitters
	• Downstream users
	• Publishers
	• Funding agencies
	• Vendors
	• Tool builders
	• Repositories

What are best practices for identifying relevant stakeholders?
How to build the community? What will the community look like?
Need to connect the various silos into one larger community (cluster)- (COMBINE and ESIP are some examples)
Once identified, governance needs to be established

Create a central data exchange where people can come together as a community

What are the incentives for participation of different stakeholders?
Citations for downstream users publishing from data- have been discussed before but a new model may be needed

PPPs (public-private partnerships) [Including government agencies] may be a great model to consider for incentives as well as stakeholder engagement. How do we promote a culture that encourages PPPs? Look at groups like PCORI (and others) to see what makes their public-private partnerships successful. Can NIH facilitate successful PPPs in the Data Standards Space?
	• IP issues can get messy (especially around openness) so they will have to be clear up front [including analytics]

What creates “value” in the area of standards development? [comment: Need to think about values for ALL stakeholders]
	• Money
	• Special considerations in grant review
	• Citations
	• Savings for companies (related to money)

standards are about efficiencies- related to savings, we are all trying to create savings money or money not spent, but raise the bar in terms of what we are trying to achieve. how is the standards make each of the stakeholder’s activities more efficient? We need to answer this question

“Public-Private Partnerships for Dummies” should be created

How do governance models support (or discourage) community engagement?
Having enough governance in place is key so that we can encourage communities to want to work together.

How does engagement translate into maintenance and sustainability?
If community is providing value to the stakeholders, it will be sustained. More community buy-in and engagement will intrinsically provide value

Group 2:
Facilitator: Michel Dumontier
Rapporteur: Lesley Skalla

What are best practices for identifying relevant stakeholders?
	• develop a broad set of use cases.
	• expand participation to a broad coalition. assume everyone has a vested interest in our science including public. be inclusive of citizen scientist, patient advocacy groups etc. coordination/info sharing amongst agencies
	• engage academic societies to inform their membership
	• engage publishers to encourage or mandate users to use this new data standard?
		• STM journals
	• simplify calls for participation: Language used by NIH is difficult for outsiders to use. If we are going to bring in more people involved in NIH funded projects, the way NIH communicates needs to be made easier to understand.

How do governance models support (or discourage) community engagement?
	• develop and sustain efforts to list and detail standards (e.g. biosharing.org)
	• coordination.
		• early engagement. do we want individual standard developers going to publishers and promoting new standards or should they go through a higher standards body (may want greater backing from wider community)?
		• WC3 as a model organization?--is responsive: have an idea, start a working group, draft recommendation, Maybe schema.org as a model? no consensus building requirement. Is more fluid.
		• Possible role of NIH to provide a taskforce to mediate, convene interactions of these standards groups so there is alignment between clinical and basic science communities.
		• calls to evaluate the standard.
		• close the gap in the a lifecycle from data development to data reuse.
		• Which groups should NIH regularly interact with? Example, NSF required vendors to ensure that all data exported in same standard format.. NIH has not done this yet?
		• Request for some sort of NIH coordinating center which has hands on the pulse of the new technologies and the communities developing the standards for these technologies. Center could also have access to data sharing plans.

How does engagement translate into maintenance and sustainability?
	• support the adoption of a standard! (new tools, etc)
	• encourage innovation to scale standards development.
	• enable the continuous development of a standard.
	• examine the role of public-private partnerships. Can we have industry fund standards? W3C is an example. Private /public partnerships. NIH cannot do this alone-need other federal agencies to join.

desirada for data standards.

notes: for RFAs how can we promote shared efforts across agencies? one community may have tools or solved problems that another one has not

patient advocacy groups, need to see what is going on, that they can see what the outcomes are, they want to be part of these new issues

	• Study Sections to value data standard development
	• New RFA on Infrastructure and expanding/modifying current standards instead of creating new ones
	• Look more globally and leverage what is already being done
	• Incentives for Data Standards development and maintenance

Think of more of a life cycle approach

Bring stakeholders in earlier in the process (providing right environment early on)

Look at what has already been developed and expand/modify/etc instead of always developing new standards

High Level Take-Homes
Money talks:
·         $ for evaluation
·         $ for stds infrastructure RFAs
·         Study sections knowledgeable about standards
Social Science [and other fields like Epidemiology] needed for good standards, non-technical parts as just as important as technical parts:
·         stakeholder engagement
·         project management
·         collaboration with global partners
·         is there a way to formalize this knowledge
Reinforce a call for a forum where communities involved in standard creation can coordinate.
Life-cycle: you probably don’t need many more NEW standards.  Integration, revision, etc. should be considered.  Each part of life cycle needs technical and social support.


Chris



More information about the FGED-discuss mailing list