Sie sind auf Seite 1von 18

Talk – Metadata Unstructured

Guidelines
Matthew Lawler

11 Nov 2019 Matthew Lawler lawlermj1@gmail.com 1


Metadata Unstructured Guidelines

• Scope + AS IS
• Requirements/Rules
• TO BE
Metadata Unstructured Guidelines
11 Nov 2019 2
Matthew Lawler
What is metadata?
Metadata is data about data.
Or metadata is data that
describes data.

It is the documentation that


accompanies the research
data which makes it
discoverable and usable over
time.
Scope will be on unstructured
data, which occurs in
documents mainly.
This is about 90% of all data.

Metadata Unstructured Guidelines


11 Nov 2019 3
Matthew Lawler
Scope
Area Topic Detail
The goal is to maximise openness and transparency, through Search and
Publication support, while minimising risk, through using available
Goal Authorisation services.
Classification type metadata will be included. See the table for sample
Metadata Classificati metadata kinds and sources. DCAT standard will be used which is endorsed
Type on by data.gov.au. See https://toolkit.data.gov.au/Discovering_Metadata.html
The purpose of the metadata will be to support Publication. Classified
Purpose Publication unstructured data is sufficient for this purpose.
That is, how can metadata support and reduce the cost the Document
Purpose Search Information Retrieval process, across all files.
Data Unstructur Unstructured data types are included. Enough to support search and
Type ed Data publication, and very limited integration using some Master data types.
Data Spatial Spatial data types are included. But only the aspects of Spatial data that
Type Data support search and publication, with very limited integration .

Unit File The basic unit will be a file.


Integrate with current tools using a federated approach. The CKAN platform
Approach will be used as the main storage and search tool.
11 Nov 2019 Metadata Unstructured Guidelines
4
Matthew Lawler
Out of Scope
Area Topic Detail
Guidance
and The purpose of the metadata will not be to support Guidance and Control.
Purpose Control Fully defined structured data is needed for these purposes.
Metadata Full Definition type metadata will be excluded. See the table for sample
Type Definition metadata kinds and sources.

Structured Structured Data is quite different to unstructured data, so this will be


Data Type Data Out excluded.

Document This will deal with high level design only. A Technical design will be produced
Type Design separately.
The Parliamentary Document Management System (PDMS) is another
external publication platform, but with inbuilt search and security capability.
Platform PDMS It will not be included in the scope of this Guidelines.

11 Nov 2019 Metadata Unstructured Guidelines


5
Matthew Lawler
Thor: ‘Where
AS IS is that file?’
Korg: ‘Doug
knows. Go ask
Doug. …
O wait –
Doug’s dead’

1. Cannot do a global cross platform search ;


2. Many drive files should be on Trim;
3. Platform Metadata mostly empty;
4. No Metadata publication support;
5. Manual Publishing to data.gov (37 files);
6. Cannot harvest other research;

11 Nov 2019 Metadata Unstructured Guidelines 6


Matthew Lawler
AS IS – There is an open data gap
Criteria TRIM SharePoint Spatial Portal
Records management Internal Collaboration
Platform purpose system system Geospatial data system
Internal/External Internal Internal Both
Check in/out? Yes Yes No?
Spatial Support? No No Yes
File types Supported? Most Microsoft Spatial
Publish to Agency Site? Yes No No
Harvest Open Data? No No Yes
Publication Open
Support? No No ANZLIC
Publication standard None None None

Full search over Full search over a


Search Content unlocked files. subset of files. Some
Search Cross platform None Some None

Metadata Unstructured Guidelines


11 Nov 2019 7
Matthew Lawler
Metadata Unstructured Guidelines

• Scope + AS IS
• Requirements/Rules
• TO BE
Metadata Unstructured Guidelines
11 Nov 2019 8
Matthew Lawler
Search and Publishing Reqts – DCAT #1
Term Description
Title Title of dataset
Description Description of the dataset
Keyword Keywords, subjects, topics of dataset (select most frequent)
Theme The government jurisdiction business function for the dataset.
Language Default is English.
Licence License type
Rights Use the standard license text to determine the license type.
Data Status Is the data updated (active, yes) or static/persistent (inactive, no) .
Update Frequency How often is the dataset updated
Expose User Contact Should the user contact details be exposed as well as the organisation details.
Landing page URL with information on resource. Must be the UUID URL.
Publish date Original publish date of record
Update date Date modified
Identifier Use a system generated UUID value.
Metadata URI Automatically generated metadata URI.
Download URL URL with information on resource.

11 Nov 2019 Metadata Unstructured Guidelines


9
Matthew Lawler
Search and Publishing Reqts – DCAT #2
Term Description
File size Automatically generated from the (MB).
Access URL Conditional: Use Access URL when resource is not a direct download
Media type Media type of distribution as defined by IANA
Format File format of the distribution. If available in IANA, use Media Type
Publisher Name of the Entity/publishing organisation. Controlled via CKAN accounts.
Contact Contact details of the publishing organisation. Inc. full name, telephone, email
Data Portal Which data portal.
Jurisdiction Which Australian Government Jurisdiction
Homepage Entity/Publisher homepage
Publisher Which person (userid) published this data?
Contact Which person (vcard) published this data?
Temporal coverage from Start of temporal series in dataset
Temporal coverage to End of temporal series used in dataset
Spatial description of resource, based on bounded box derived from placenames
Geospatial coverage defined by gazetteer
ISO 19115 Topic Main theme(s) of the dataset if spatial
Field(s) of Research The Australian and New Zealand Standard Research Classification (ANZSRC),
Data Models Free text option to add information about models, ontologies, taxonomies, etc.

11 Nov 2019 Metadata Unstructured Guidelines


10
Matthew Lawler
Term
Searching Reqts – Agency Specific #1
Description
Auth - Licence License type detected
Auth - Publication State CanBePublished | IsPublished | ReadyToPublish | Unpublishable
Auth - Security Class Final Manually entered
Auth - Security Class
Provisional Unofficial | Unclassified Open | Unclassified DLM
DLM Duration Default 10 years; shorter for market data
DLM Start /Expiry Date
DLM Reason Strings used to indicate DLM type
No DLM (Open) | DLM FOUO | DLM Sensitive | DLM Sensitive Legal | DLM
DLM Type Personal
ID - File Name Name of file
ID - Paths Prior paths for file location (preserved as the file is moved around)
ID - Trim Record Number Populated if document is on Trim. E.g. E2017/0399 or D17/14967.
Lookup Org Units A list of Org Units mentioned in the file.
Lookup Organisations A list of external Organisations mentioned in the file.
Lookup People A list of people mentioned in the file.

Lookup Spatial Coordinate List List of Coordinates mentioned in File


Lookup Spatial Location List List of Placenames mentioned in File
Terms - Act names What Acts or parts of Acts are mentioned in the file?
Terms - Namespaces Any special terms that belong to relevant Namespaces.
Trace Quote Reference Any quote codes from Finance system.
Trace Work Order Code Any work order codes from Finance system.
11 Nov 2019 Metadata Unstructured Guidelines
11
Matthew Lawler
Security and Publication Types
1. This shows the
Security Class and
Publication State
Diagram.
2. Shows transitions,
and implied rules.
3. Files classed as
Unofficial or
Official DLM are
regarded as
Unpublishable.
4. Unclassified
documents are in
'Can be Published'
as default.

11 Nov 2019 Metadata Unstructured Guidelines


Matthew Lawler 12
Publication Rules
This is a proposal for integrating Security Class and Licence Type.

Security Licence
O Class Publication Metadata Type Platform

1 Unofficial Unpublishable Minimal Any Only Internal Platforms

2 Official DLM Unpublishable Minimal Any Only Internal Platforms


Official
3 Unclassified CanBePublished Minimal Any Only Internal Platforms
Official
4 Unclassified ReadyToPublish Complete Any Record Mgt System (Eg TRIM)
Official CC-BY-3.0-
5 Unclassified IsPublished Complete AU Internet site and Record Mgt System

11 Nov 2019 Metadata Unstructured Guidelines


Matthew Lawler 13
Metadata Unstructured Guidelines

• Scope + AS IS
• Requirements/Rules
• TO BE
Metadata Unstructured Guidelines
11 Nov 2019 14
Matthew Lawler
TO BE
1. A common standard to avoid a metadata war, incorporating DCAT +
unique Agency metadata;
2. Stand up the CKAN Platform;
3. CKAN will use pointer addresses (FTP/HTTP/etc) back to the original file
4. Files will remain on platforms such as Trim, SharePoint, File system etc.
5. DCAT metadata files stored by CKAN;
6. Cross platform search now possible by CKAN;
7. Publication supported by CKAN;
8. Staff can harvest other open data by CKAN;
9. DCAT uses the W3C RDF format;

11 Nov 2019 Metadata Unstructured Guidelines


15
Matthew Lawler
TO BE
RDF

Metadata Unstructured Guidelines


11 Nov 2019 16
Matthew Lawler
Define
Term Definition
CKAN CKAN manages the storage and distribution of open data.
Data Catalog
Vocabulary DCAT is a vocabulary which facilitates interoperability between Web data catalogs.
DCAT Data Catalog Vocabulary
Dissemination Limiting A DLM is a marker that means that a file cannot be published on an open web site, such as data.gov.au. DLM
Marker reasons include FOUO, Sensitive, PII, HR, etc.
Metadata Metadata is often called ‘data about data’.
A common result of enforcing a single metadata platform. This will be avoided by using a federated approach,
Metadata Wars
with metadata spread across multiple locations, but conforming to a common standard.
PDMS Parliamentary Document Management System
Published Any document to the Agency external website.
Security Classification A protective marking placed on files to help determine authorised use.
SharePoint SharePoint is a collaboration and document management system.
Spatial Data Any data or information that identifies the geographic location of features and boundaries.
Structured Data Structured files will be those that enforce strong, well defined typing. Examples include MDB, XML, KML, etc.
TRIM Total Records and Information Management
Unstructured files will be those that are weakly typed, and allow any data type to be inserted anywhere in the
Unstructured Data file type. Examples include DOC, XLS, RTF, etc.

11 Nov 2019 Metadata Unstructured Guidelines


17
Matthew Lawler
Metadata Unstructured Guidelines

• Scope + AS IS
• Requirements/Rules
• TO BE
Metadata Unstructured Guidelines
11 Nov 2019 18
Matthew Lawler

Das könnte Ihnen auch gefallen