File Formats 1a(ii) Minutes
Present in the session were Andrew Caruana, Thomas Saerbeck, Oleg Konocalo, Francesco Carla, Bridget Murphy, Becky Welbourn, Wei Bu, Wojciech Potrzebowski, Mrinal Bera, Robert Dalgliesh The session was tasked with discussing and providing answers to a couple of very specific questions which it largely turned out were fairly generically answered Who are file formats for? Mainly for the end users The expectation here being that the facilities provide a data set for non-expert users that they can use immediately in fitting software and for plotting. For expert users a more detailed format will be required that will enable more advanced post experiment data manipulation In addition to this there is increasingly as a means to share data with the “public” as a result of open data policies. These file formats should be self-describing and provide links to the original data through DOIs. The feeling within the group was that the initial “simple” data file format that answers the majority of the end user’s needs (probably ASCII based) should be developed first with the more complete self-describing file certainly being a requirement but can come along later. What are the requirements of different funding agencies? In Germany it is not yet clear exactly what the funding agencies actually require. The facilities have taken the approach that they need to take the lead on this and have developed a bottom up + top down approach. This has meant looking at the requirements of the journals, users and beamline scientists in order to develop multiple files that will enable FAIR data management and principles to be followed. In the US there is no clear data policy coming down from the funding agencies as far as the attendees from APS knew. The labs are formulating their own. This will need to be confirmed in with representatives from NIST, ORNL etc. In the UK, UK Research and Innovation (UKRI) support the concordat on open research data (https://www.ukri.org/files/legacy/documents/concordatonopenresearchdata-pdf/) Which has ten over arching principals of which 6, 7 and 8 are probably most relevant here 6) Good data management is fundamental to all stages of the research process and should be established at the outset. 7) Data curation is vital to make data useful for others and for long-term preservation of data. 8) Data supporting publications should be accessible by the publication date and should be in a citeable form. In addition there is a statement that open access to research outputs and data should undertaken wherever possible but this decision should be based on cost among other criteria. We cannot find anything specific at the moment that goes any further than this.
Further information will need to be gathered from other EU countries. What are the national data policies? From the point of view of data retention and protection of the rights of the users to their data seems to be fairly consistent in Europe at the facilities. ISIS operates a policy by which data is made public after 3 years but can be made open access earlier if requested by the PI of an experiment. ESRF is 3 years which can be extended to 5 years upon request by the PI ILL is protected for 3 years which is automatically extended to 5 years unless the PI says otherwise after which the data becomes open access. ESS is expected to adopt the PANOSC (https://www.panosc.eu/about-panosc/) In the US there is currently no concrete policy even with the APS but this currently being looked at. Other facilities will have to be looked into. The type of data that is archived and access controlled is also interesting to note. Petra stores raw and reduced data in the archive areas as does ILL. ISIS only archives raw data and any automatically reduced data but does not perpetually archive any other reduced data generated by the instruments. There is a push to move towards automated data reduction which will generate more reduced data but this may often not be the “final” reduced data set and may only be used for diagnostic purposes. Other Issues and further reporting Mrinal Bera will feed back on data policy in the US Bridget, Thomas, Wojciech and Francesco will report back with further details on either facility or national data policies. Thomas has already done this The ILL data policy is summarized on this web-page, https://www.ill.eu/en/users/user-guide/after-your-experiment/data-management/ and a whole legal text is attached. All the distinguishing between raw-data, metadata, processed-data is made in the attached file. In short: Since October 2012 a non-dosclosure period of three years exists on all data (raw, meta and processed), during which the data access is restricted to the experimental team. This period is extended to 5 years if no access request is made. After 5 years anyone can access without request - as far as I understand.
Commercial Data The point was also made that industrial users who pay for access directly will have a different set of requirements and restrictions placed on their data and so any file format need to take account of this possibility and be flexible enough to allow for other differences in data collection or retention policies or permissions. Analysed/fitted data format The question of an analysed data format similar to CIF files was raised briefly but was not discussed in depth because of the feeling that the analysis and interpretation of reflection data has many subtleties that are often best discussed in a paper. This may change given the potential need to provide a framework/file to describe reproducible data analysis (see the reproducibility sessions) Digital Logbooks The issue of digital log books and how they are stored was briefly discussed. Much of the information required is already gathered and saved along with the raw data at many facilities but not all. The question of whether this data would be best stored in a separate file from the reduced data was mentioned but will need more discussion.