[FGED-discuss] Notes from the Community Based Standards workshop (Part 3)

Chris Stoeckert stoeckrt at upenn.edu
Wed Feb 25 23:45:41 UTC 2015

From the afternoon session.

Session Four: Complexity and Data Diversity

Philippe Rocca-Serra
ISA point of view (ISATools)
·         Std need: to enable comms, release, and preservation of data
·         Needed? disruptive tech w/ massive but uncontrolled data growth, I/O bottlenecks
·         Working well? Uptake by users!
·         Evaluate them? ease of use, support, documentation, implementation guides, flexibility, extensibility, etc.
·         What std will work best? Create and Curate registry of stds (biosharing), Create metrics and eval criteria, neutral assessment by review or standardization bodies (NIST?)

Michel Dumontier
	• W3C HCLS dataset description,
Using RDF but did not have descriptions for it. People were regenerating the same data and did not know it. Developed guideline for describing datasets. “A two month effort easily becomes a two year thing when there is no dedicated staff in charge.”Proposal: Dedicated staff needed.
      2. Bio2RDF which is linked data for life sciences.
Problem is that data providers are always changing formats.
Proposal: Add data interoperability into data sharing plans.

Charles Bailey 
	• Needs for common terminologies.  Consensus on the standards, where do you need to invest in alignment, and where are the areas of diminishing returns?  Gave the example of 2 major studies that defined things slightly differently, and ended up with two silos of data with opposite results to answering essentially the same question.
	• Repeatable data characterization - how much metadata needs to be defined to make them repeatable and well characterized?
	• Domain specific requirements - regarding how usable is the standard/ useful -- how extensible is it, so that it can be used to facilitate interoperability.

Olivier Bodenreider -
	• Do we need additional standards?  (no).
	• Encourage reuse - make easier to discover and adopt but sometimes goes through mappings.  Both mapping and post-coordination extremely hard.  (Putting together needs to obey some rules of syntax and semantics).  Example of SNOMED, that difficult to use but at least makes post-coordination possible.
	• Depending on how it is consumed, may be easier to look at/ browse/ consume -- It is also possible to create services (e.g. APIs). Example of RxNORM.  Terminology integration system for drugs.  Dozen drug terminologies integrated; allows cross walking, and creates named relations across different types of drug entities.  However, noticed it was difficult to use RxNorm.   Because were looking at the database not the logical structure.  Now REST API makes it easier to use.  Point is that you don’t necessarily create new, but rethink how to present, distribute, make it easier for people to get at.

W3C technical support, access to teleconference, commitment to publication longevity
this support is invaluable. (e.g. a pool of such agents to help get the standard through the process.)

should standards be considering IP?
standards should be freely available
mechanism to keep door open when money is spent is funding bodies like NIH
to what extent is this process science itself - not good for standards efforts to be relegated to spare time, needs full time support

Actionable suggestion that NIH could take: Michel (Registry of things that are there, for those that exist where there is poor management, or unclear use cases,  assigning project managers to help); think about standards in their own right rather than part of a DMP; if you want people to use data elements, need to provide repository and demonstration that they can be useful.  Philippe - Evaluation.  Inclusion of social scientists to study how people are using and why they don’t use.  Will help build the better standard of the future.; Chris Chute -- use case driven standards slippery slope -- secondary uses already come up. We need standards that can represent the data full stop.

Recurring Themes from Day One:
	• Funding or other incentives for recognition/publication needed, because volunteer efforts are limited.
	• If publication is the goal, can lead to fragmentation b/c people want to be novel/individual for publications.
	• Hardest part of developing standards is negotiating with people, social aspects are more difficult than technology development.
	• Incentives structure needs to be refined.
	• Technology proceeds faster than std development, so a dynamic approach is needed, but difficult to implement.
	• Not enough evaluation methods for stds.  How do we decide that they are working?  That they are being used?  That they need to be changed/updated/killed?  What metrics will we use?
	• Reduce/Reuse/Recycle: do not want to reinvent the wheel when creating/implementing/setting/revising stds.
	• If no one set of common data standards will work for all purposes, how do we even make standards?  Thinking of standards as boundary objects may help.
	• What should the life-cycle of standards be?
	• How do you balance community buy-in with a need to manage the std creation process?  Bottom-up vs Top Down approaches.
	• Building standards without due consideration of use-cases and end-users is a fruitless exercise. Use of standards (or not) by clinicians in clinical settings reinforces this idea.
	• When do we need standards?  Standards are useful, but labor-intensive and quickly become obsolete.  The question becomes: When is it worth it to develop standards and how can we make use of existing models/resources/etc.?

Session 5 Synthesis Breakouts

Group 1: Melissa, weida Tong, Yaffa, Chris Chute, Chris Mungall, Lynn E.,
Possible topics -
	• When in the research cycle do standards come into play?  (e.g. people start the research before submitting a proposal)  FOR TOMORROW
	• Educational component -- part of PHD program?  or remedial training?
	• Not everyone on NIH study sections have data standards expertise, how do you educate/ encourage/ etc. the study section members to pay attention to the data management plan?
		• Matrices of domain by standard by rigor?  Educate about what?  (There’s a BD2K program dedicated to that)
	• What from the perspective of reviewer, do they need to score the proposal?
	• What do we do next?  
		• How do we feed back what we’ve learned in the education part of BD2K to this?
		• Most investigators don’t get the vast shift in biomedical research.  Entering into team science, and very data intensive uses in research.  
		• Burning issue of provenance and reproducibility -- accepted that it needs to go into the education landscape, and standards goes along with that.
		• DDI creating some of the
		• Concierge - a person that helps connect you with the right approach for what you’re doing with your data.    And the people should be able to feed back to the NIH -- redundant, incompatible to fix -- standards navigators -- Like a library function -- like preliminary library search -- ‘standards informationist’ to help them develop a data management plan.  Even the the librarians don’t know all the standards - they send you to … Need subject area knowledge  -- Need the infrastructure to support the service -- DSR (data standards repository) --?  -- effective?  connected to data?  
		• Connect the standard information to the repository and back the other way as well.
	• Granular collections of data (atomic data standards) -
	• We need to identify where translational research is blocked because of issues with data standards. (inadequate or conflicting data standards)
Takehome messages, recommendations:
Actionable items/ recommendation -
	• NIH study sections need to include review of data management plans by information science experts.  (And/or assistance to develop?)  A separate score. (Pre-review to stratify? Or short proposal review process?)  Need NIH to say up front what they’re looking for.  If they get funded, part of the funding goes to work with NIH data scientist to create the data management plan.  You don’t want to kill great science on a technicality.  What about costs associated with data management?  Can vary.
	• re: Educational portion of BD2K program -  provenance and reproducibility -- accepted that it needs to go into the education landscape, and standards and DDI goes along with that.  Note that we need to teach the teachers.  More coalescence or refinements of best practices in  [Train pie people]  
	• Identify where translational research is blocked because of issues with inadequate or conflicting data standards.
Group 2: Alexa McCray/ Sheri Schully

Alexa McCray
Sheri Schully
Anita  de Waard
Olivier Bodenreider
Eric Neumann
Michel Dumontier
Tom Oniki
Hong Huang
Chris Stoeckert

Takehome messages, recommendations:

Identify gaps with greatest potential impact:
	• Timing takes way too long to accomplish standards- Shorter Life Cycle is needed- need to increase the pace
	• Maybe we disrupt the entire model of how we develop standards
	• Lack of incentives to develop standards (needed dedicated project staff for these purposes) (dedicated mechanisms for funding standards development)
	• Evaluation and metrics for Success need to be built in
	• Clear use cases (or competency questions that can be validated) for each standard development with people who are invested in the outcomes
	• Currently no way to apply for grants that address Data Science (metrics, infrastructure, evaluation)
	• We are living in the past with regards to software. Modern software capabilities require scientists to partner with industry (including amazon, facebook, Banking, etc) CRAY is one example of a company that does this type of thing
	• Leadership is needed in standards development (funders, scientists, journal editors all together)
	• Gap in education in standards development (need an educational component)- train grantees and reviewers in data science (potential checklist for reviewers)

Common Pain Points:
	• Get working processes of software development in place with domain experts and software developers.
	• More specific data sharing plans with clear criteria (more formal- adhere to  standards). White paper on criteria that should be used. Proof that data were shared (like a Data Index).

Prioritized Needs:
	• Data Science 2.0 (include data scientists as a part of R01s)
	• Data Standards development needs to be overhauled to adopt industry standards and tool development. Can build from some of the early industry tools and adapt them for your needs. Can utilize SBIRs for this area….
	• Proof of concept data science pilot projects (ex: precision medicine proof of concept).
	• Need to have more merging of Industry and Academia in this space. Be clear about where IP begins and ends if we want industry to come to the table.

Identify Best Practices:
	• All about Openness:
        standards have to open- careers can be made using these standards
Group 4:
Facilitators: Philippe Rocca-Serra

James Taylor, Sofia Heidi, James Overton, Eva Huala, Todd Carpenter, Michael Hogan, Gary Bader,  Lesley Skalla, Ryan Brinkman, Charles Bailey

Takehome messages, recommendations:
	• need to scope the effort (avoid slippery slope)
	• too avoid steep increase of complexity (as assessed by parsing for example)
	• what drives overcapture is underspecification
	• when is use reasonably anticipated.
	• opportunity cost: capturing information and then later converting to a standard form.
	• different kind of users of the data standards
		• creator/collector of data
		• consumer of data.
	• at least one use case for justifying inclusion of data element in a model
	• include use case as part of the standard as unit test  (a freely available of an implementation : (ideally 2 independent implementations)
	• include a discussion in the data sharing plan about which standard will be used

Closing the loops:
	• document why a standards is not being used -> feedback this to the relevant efforts
	• creating a credit record for sharing data in a standard format (link to biocaddie, ddicc, cedar)
	• measuring how data are being reused as evidence of efficiency of a standard


More information about the FGED-discuss mailing list