Grants Council; The Swedish Council for Research in the Humanities and Social Science, and The
Swedish Foundation for International Cooperation in Research.
Data Management Processes
Data Acquisition: The Research Team in each participating country is responsible for contributing and
collecting a full set of data
for that country. Project partners are at different stages of completion of this
process; with some yet to commence. It is estimated that the cost per country for creating their own dataset
is currently in the vicinity of AUS$300,000. This investment provides each country to the data of all
participating countries in the project.
Current ingestion and distribution of data between project partners is via CD, DVD or mobile pocket
(80GB) drives. The goal of the current project is to change this to a distributed international network for
ingest, access, storage and analysis.
IP/Copyright of Data and scholarly output: The IP of each country’s dataset remains with the data
creators, the contributing research team, and it is the responsibility of the project leader in each country to
authorise access to the IP associated with the raw data, including approval for analysis proposals for that
data. It is assumed that researchers within the collaboration using the data of another country duly
acknowledge the authorship and origin of that data in any scholarly output generated from that data.
Data Quantities: It is estimated that each country’s raw dataset is 50GB. This will build up to around one
terabyte of raw data for the current 18 collaboration partners. This does not include the data associated
with any analyses or annotations of that data or any other scholarly outputs.
Data storage and Backup: The ICCR storage capacity is 11TB and adequately meets current project
needs. In preparation for projected increase in storage needs an additional 14TB of storage space to be
held within the Faculty has been requested for back up of ICCR data.
Once acquired, data are compressed or converted into the current format standards (currently MPEG 4,
text and PDF) and stored on the centre server. Offsite back up of this version is made every time a new
batch of data arrives and maintained at ICT Building, 111 Barry St. Local project output at Melbourne,
namely analyses of data and scholarly outputs are backed up nightly and stored locally and offsite at the
ICT Building, 111 Barry St.
The archiving of raw data are essentially the storage of original data as acquired from project partners – as
DV tape, DVD, CD and to a lesser degree, analogue VHS and audiotapes; all held onsite in filing cabinets.
There is no cataloguing of this data. It is assumed that originating countries also maintain copies of their
own raw data. Raw Classroom datasets are re-distribution to other sites in same digital format as acquired.
There is no updating of file formats or standards of this data; it remains as acquired.
The project team believes that this collection is one of local, national and international significance which
should be preserved for re-use and re-purposing by future researchers. Current raw data storage formats
suggest that the collection is not sustainable long term. There is a need to investigate the need for format
migration to protect the collection from digital format obsolescence, particularly where commercial
formats have been used.
Data formats: A combination of open source and commercial formats are used by the project. The centre
provides partners with documented guidelines
for preferred formats for data:
Video data - preferred format is MPEG 4, but also submitted as: .mov, .mp4, MPEG1.
Text data - preferred format is .txt files or PDF, but also submitted as: .txt; .doc; TIFF and JPEG.
Image data - preferred format is PDF, but also submitted as: TIFF and JPEG.
The dataset for each country includes all materials associated with the delivery of a collection of thirty classroom lessons (ten
lessons for each of three teachers), including videoing of these 30 lessons from three camera perspectives, post lesson interviews,
transcripts, translations, general information about the school and other contextual information.
Please contact the Technical Manager for further information about the Technical Guidelines for LPS Data Processing
C# Raster - Image Process in C#.NET
Image Access and Modify. Image Information. Metadata(tag) Edit. Color VB.NET How-to, VB.NET PDF, VB.NET Word process various images, like Jpeg, Png, Bmp, Xmp pdf metadata; extract pdf metadata
Video data analysis in Melbourne (and for some of the project partners) is conducted using the
commercial software StudioCode
; developed in conjunction with the Australian software company
Sportstec. The project is currently in negotiations with Sportstec for making this software “Grid-enabled”
or distributed for access to all partners. This would enhance opportunities for collaborative analyses of
project data. Analyses using StudioCode are not accessible without the software limiting some types of
research collaborations across project sites.
Long term use of this commercial software will require the storage of the software. This process has been
necessary for the preservation another analysis tool used by the project - VPrism; discontinued software
that was previously used by the Melbourne team and remains in use for some data and by some project
partners. This software has been archived with the OS9 Mac operating system needed to run it.
Metadata: There is no systematic cataloguing of the data collection currently held within the project.
Much of the information exists within a complex files classification system
for the raw datasets but is
not currently extracted and managed within a relational database and is not easily discoverable. Key
project information is held within the heads of key personnel or in emails or within other project papers
The project plans to develop a metadata schema for information around rights and permissions;
provenance; technical metadata; administrative/management and bibliographic/descriptive attributes of the
data will be extracted and/or developed. It is anticipated that research teams will take responsibility for the
development, maintenance and management of this metadata. Advice about metadata schemas for data
management and collection preservation will be sought at that time to ensure should this be deemed
desirable by the collaboration.
Data Access, Authentication, Authorisation and Security: The current protocol is that data are stored
centrally in Melbourne for all project sites. Arrangements for access of data by researchers within the
project collaboration are manually managed, requiring the Melbourne facility to act as intermediaries for
the data access requests. A more distributed model of data access and storage may streamline some of
A condition of access to a country’s data is that the requesting researcher must gain approval for his/her
proposed analysis from the originating country (the owners of the data IP). This acts to safeguard against
duplication of the same research outputs from different sites and ensures that first option for particular
analyses goes to the data creators.
PHASE TWO CONSULTATION
This project engaged in further consultation to identify and implement change around their data
management processes. The focus of this follow up was particularly on middleware selection and testing
This activity has commenced and will continue as part of the services provided by Information
Services personnel at Research Computing and Information Management.
The exploration of various data infrastructure (middleware) models to address the distributed data sharing
and analysis across the global collaboration:
Establishment of an SRB testbed and testing suitability for project needs.
Currently using version 2.0.5 of StudioCode. More information about the software is available at:
This is outlined in the Technical Guidelines for LPS Data Processing for the raw data compilation.
SRB – The SDSC Storage Resource Broker (University of San Diego) – supports shared collections that can be distributed
across multiple organizations and heterogeneous storage systems. The SRB can be used as a Data Grid Management System
(DGMS) that provides a hierarchical logical namespace to manage the organization of data (usually files). More information is
available at: http://www.sdsc.edu/srb/index.php/Main_Page
Sakai – participation in Melbourne collaboration to explore this platform to meet the needs.
Initial investigations suggest that Sakai may provide a possible solution for the second phase of
their project: improving the interface and providing collaborative tools.
The diagram below illustrates the proposed amended architecture for the collaborative infrastructure,
incorporating the SRB layer. This is still evolving.
© 2006 Mitchell ICCR
Discussions about metadata schema development to meet increasing need for archiving of local project
and collaboration assets. Current practice is not sustainable. Consultation with the collaboration is needed
to consider the development of appropriate schema for the group.
Discussions about data archiving, preservation metadata and consideration for the long term preservation
implications of current practice have commenced. This is not a current priority of the project but will need
to be raised with the collaboration.
Neuroscience MRI Computational Facility
The focus of this audit was the computational facility and repository for neuroimaging data (established
March 2006) and the sharing of computing and information resources among neuroscience researchers.
Two key members of the project team were consulted:
Associate Professor Gary Egan, Project Leader
Dr Neil Killeen, Facility Manager
Project Partners/Collaborators include:
Howard Florey Institute.
University of Melbourne (Electrical & Electronic Engineering, Computer Science & Software
National ICT Australia (NICTA).
University of Queensland (Biomedical Engineering).
Flinders University (Psychology/Cognitive Neuroscience).
More information available at: http://www.sakaiproject.org/
Silicon Graphics (SGI).
UNSW (Prince of Wales Medical Research Institute).
Data Management Processes
The core objective of this facility is to generate a distributed neuroscience facility. The current model for
the facility consists of a primary node and two secondary nodes: University of Melbourne (primary node);
University of Queensland (secondary node), and Flinders University (secondary node). This structure may
change over time; nodes may expand and more nodes may also be added. Currently, it is expected that
each node will provide both computational and some data storage capabilities, however, the primary node
will be substantially more capable. The current phase of the facility set up is focusing on the functionality
of the primary node at Melbourne (the team with whom we have consulted) and the establishment of the
Data Acquisition: Data are current research and legacy MRI data acquired from a number of scanners.
IP/Copyright of Data and scholarly output: Human MRI scans are de-identified subject (research
participant) data. Generally, the IP of data belongs to the contributing researcher, though facility policy
around this is yet to be documented
. Data reuse policy is also under discussion and consequently, only
non-human data are re-purposed within current guidelines.
Currently: 3 TB of existing data are being transferred into the facility.
Projected: An upper limit of approximately 30 TB within 2 years is possible based upon expected data
acquisition and potential collaborations. This includes one time legacy transfers. The current expected rate
of data growth is <10TB/annum.
Data storage and Backup:
Current storage capacity is 62Tb (actually 33Tb as data are stored redundantly on tape).
Data back up and storage processes are:
Data are migrated transparently between disk and tape via the Data Migration Facility (DMF)
software. From the user’s perspective, data are always ‘online’.
Data are migrated redundantly (2 copies) to tape by DMF, but not all files (e.g. small files)
migrate to tape.
Data that are mistakenly deleted can be retrieved for up to 7 days.
There is currently no other copy of the data managed by the Facility. Data originating at hospitals
will usually have a copy there (but not easily accessible to users).
The Facility plans to make offsite copies of high-value assets from the data storage facility.
The facility needs to establish a reliable off-site store for back up storage and has begun exploring the
possibilities. The University Data Centre is a resource that might meet this requirement. However, the new
University Queensberry St Facility will not be available until second quarter 2007; its capacity may be
insufficient and the business model is not yet clear.
Data formats: Data formats and software are both open standard and commercial. Images formats are
, NIFTI and ANALYZE are de facto standards.
The Archive is built on top of a commercial Digital Asset Management system: MediaFlux . Data are
pushed to and pulled from the Archive via MediaFlux. A DICOM server has been built into MediaFlux
providing tight integration between remote MRI scanners and the Archive. Thereafter data are retrieved
A number of these issues have been raised in facility issues paper: Asset Privacy Requirements and Implications, Killeen (May
2006). Policies are yet to be formalised.
DICOM is the MRI clinical standard. At times it does not meet the needs of the researcher.
NIFTI is a new data format standard for analyzing Neuroscience imaging data. Processed data uploaded into the Archive will
be in NIFTI format.
The facility manager has worked closely with the software provider who has custom built the system for the MRI facility. More
information about this software is available at: http://www.arcitecta.com/mediaflux/technology.html
from the Archive, with optional format conversion (e.g. DICOM-NIFTI) and various Neuroscience
analysis software packages are then used to analyze the data. The metadata may be retrieved from
MediaFlux into interoperable XML files
and the data format itself is not changed by MediaFlux
The facility is developing its own custom interfaces to MediaFlux that will manage the user’s
workflow (ingest. Egest, processing pipelines, etc).
Metadata: Currently, the project is only holding technical metadata harvested from the DICOM
metadata. It is anticipated that this will broaden in the future. The metadata schema by the facility is
loosely based on the BIRN
schema and can be retrieved into other schema. The metadata are stored in a
PostGres data base managed by MediaFlux. In the future, MediaFlux will have a self-contained database.
Associate Professor Gary Egan states that the Australian neuroscience research community is currently
working on developing an ontology for its data, as are other such communities in the US and Europe. This
process foresight into how future researchers will want to describe their data which needs a strong
conceptual framework; it needs to describe what the data are and how they might relate to other data that
are potentially of interest in multi-discipline analyses. Participation in global collaborations like the
enables such work.
Data Access, Authentication, Authorisation and Security:
Currently data are only available to users within the project collaboration. Access to the collection is via
password authentication and does not have an open IP address.
Data from the Royal Children’s Hospital scanner are transferred to the Facility through a public IP with a
controlled Access list. In the future the facility will move this to a VPN for greater security. Access
between Howard Florey Institute and the Facility is currently via a VPN.
Astrophysics - Australian Virtual Observatory
The focus of this audit was the Australian Virtual Observatory (Aus-VO) and the Australian Astronomy
Grid (Aus-VO APAC Grid) which is being developed to handle the data storage and access needs for the
research community. This project is building a distributed high bandwidth network of data servers
project is located within the Astrophysics Research Group
in the School of Physics.
Three researchers were consulted during the audit:
Professor Rachel Webster, Lead Investigator
Dr Katherine Manson, Grid Research Programmer
Dr Randall Wayth
The Aus-VO partners with the IVOA (International Virtual Observatory Alliance)
University of Melbourne
Swinburne University of Technology
University of Sydney
Biomedical Informatics Research Network US based network developing standards and tools for neuroinformatics. More
information available at: http://www.nbirn.net/index.htm
The Global Science Forum (GSF) of the OECD initiated the International Neuroinformatics Coordinating Facility (INCF) to
further the development of Neuroinformatics as a global effort with the support of all ministers of research within OECD. More
information available at: http://incf.org/
Refer to Webster slide 8 URL: http://astro.ph.unimelb.edu.au/~rwebster/MU_mar05.ppt
University of New South Wales
University of Queensland
Australian National University and Mount Stromlo Observatory
Australia Telescope National Facility
Australian and Victorian Partnerships for Advanced Computing (APAC & VPAC)
Funding sources include:
l grants, including NSF, APAC
Data Management Processes
This research community requires a highly specialized technical skill set.
The collaboration involves many
astronomical surveys to be conducted between 2003 and 2008. Each survey involves dataset sizes ranging
from 10-100 terabytes and 10 to 100 researchers. “It is no longer feasible to function effectively in
Astrophysics as a solo researcher.” (Webster)
Data Acquisition: The primary creator of data is the observatory. In the majority of cases observatory
data are virtually inaccessible to the outside world. To get data you actually have to go to the telescope
with your tape and download it (Webster). Each observatory has its own system for archiving data with
some not archiving their data at all. The Aus-VO collaboration is currently working on establishing a
distributive model to increase data accessibility among partners. Once data are acquired by the research
community, the primary responsibility for maintaining the data lies with the researcher or project group.
Aus-VO data, sourced from observatories, is pre-processed prior to ingest. This includes quality screening
(E.g. “Do we want to keep this?” ). There is also a reduction process that occurs on the raw data. Some of
this may be done by the instrument but most is done by the researchers. Researcher confidence in
automated reduction processes are generally low; with most preferring manually process/calibrate the raw
data (or their trusted PhD/Post docs).
IP/Copyright of Data and scholarly output: this currently sits with the Observatory and the Chief
Investigator (CI). Observation time is booked with the observatory by the research project team. Data
collected over these times (which may run for a number of days/weeks) remains the IP of that project (CI)
for 12-18 months, depending on the observatory policy. After that time the data must be made publicly
available. In all cases the observatory would be acknowledged as the data source in scholarly output.
Data Quantities: The current store of around 40TB of Aus-VO data is quite scattered. All Melbourne
data are stored and managed by APAC and held at the ANU Peta-Store. Swinburne has about 12 TB of
data on tape that is around 10 years old and sustainability of this collection is unclear.
Projected data store requirements are set to increase dramatically from 2010 when the new Australian
telescope which will be generating 1-2.5 TB of data per day. The Melbourne team has bid to manage the
data from this telescope.
Data storage and Backup: As mentioned above, Melbourne’s data archiving is currently maintained by
APAC. Back up of this data is managed as per APAC protocols with the Peta-Store which is a fully
managed storage facility. Individual researchers at Melbourne also maintain their own personal research
data and scholarly output on their own PCs which are backed up by Faculty servers.
Long term storage for this research community is still up for discussion. IVOA established an interest
group on Data Curation and Preservation
which is working towards identifying both mechanisms for the
long-term preservation of astrophysics collections and sustainability procedures to ensure continued
More information about the IVOA Data Curation and Preservation interest group available at:
access to astrophysics collections that are at risk. Making decisions about what should and should not be
preserved remains debatable and parameters will need to be set by the community.
Professor Webster sees that the collaboration in Australia which is part of the global community would be
well served by a federated national facility. Sites within the federation could easily be an institutionally
based well managed data centre or at some other trusted site.
Data formats: All data formats and software used are open standard or locally developed. In some cases,
telescopes may develop some of their own unique file formats. The IVOA facilitates collaboration on
standards to ensure the interoperability
of the different Virtual Observatory projects.
FITS and its various extensions is the open standard (NASA) for this research community. This standard
has remained very stable and is used for archiving and transporting astronomical data. It is also backward
compatible for around 20 years. Raw and processed data are accessed and stored using FITS.
Data occurs in different versions:
Observational data that comes from the observatories
Raw data – as it comes out of the telescope in raw state
Processed data – raw data that has had some cleaning by the instruments
Science ready data – Observational data that is processed/calibrated by researchers. In some
cases this can be an automated process. In general however, astronomers prefer to implement
these calibrations on their own observations (or use graduate students here) rather than
trusting any automated process.
Simulations – data generated by researchers using some sort of modeling software/process
Data Analyses of any or all of the above versions are generated by researchers using various
software/code for analysis, e.g. Myriad, Iraf.
Metadata: The amount of metadata is variable.
FITS captures some metadata. There is no current
schema standard being used though this may change with IVOA data curation initiatives. For
ground based telescopes astronomer log books document important data like the ambient
temperature, the moisture, whether it’s raining. This would need to be entered manually during
data archiving but tends not to happen.
Data Access, Authentication, Authorisation and Security:
The data has restricted access for 12- 18 months (depending on the observatory) and then becomes openly
available. Access is managed by the observatory and varies accordingly. The APAC Grid project is
looking at a distributive model for transferring and accessing data. Security issues associated with data
transfer and still being developed.
Hydrological Measurement and Monitoring
The audit was focused on the Hydrological Monitoring Network projects of the Hydrology Research
Group, Department of Civil and Environmental Engineering. This data collection has national and
international significance. Soil moisture data has been collected from parts of regional Victoria and New
South Wales since 2001. It is currently the only Australian collection of soil moisture data that is
presented as an accessible website containing background information about monitoring sites, instruments
and drilling right through to the actual data. The goal, subject to funding, is to preserve the collection and
continue to collect data long term.
More information about interoperability of standards: http://www.ivoa.net/twiki/bin/view/IVOA/WhoIsWho
Project websites with data presentation are located at: http://www.civenv.unimelb.edu.au/~jwalker/data/oznet/
Documents you may be interested
Documents you may be interested