Please comment: On the suitability of w3c Media Fragments for biodiversity multimedia

Within TDWG Audubon Core, we are considering what is a good standard to label information in sub-regions of sound recordings, images, etc. For example, I can draw a rectangular box in an image or a spectrogram, and give it a species label. This happens a lot! How can we exchange these "boxes" between software and databases reliably?

The question is: should we use the w3c’s "Media Fragments" syntax? In particular, I’m looking at section 4.2 about selecting temporal and spatial sub-regions.

Temporal region examples:

    t=10,20   # => results in the time interval [10,20)
    t=,20     # => results in the time interval [0,20)
    t=10      # => results in the time interval [10,end)

Spatial region examples:

    xywh=160,120,320,240        # => results in a 320x240 box at x=160 and y=120
    xywh=pixel:160,120,320,240  # => results in a 320x240 box at x=160 and y=120
    xywh=percent:25,25,50,50    # => results in a 50%x50% box at x=25% and y=25%

My perspective:

The definitions for the content of the values are good, and we should directly follow their example. (For time, the values are Normal Play Time (npt) RFC 2326 which can be purely in seconds or in hh:mm:ss.*, and other formats such as ISO 8601 datetime can be used as "advanced" use. For space, values are in pixels or percentages, with pixels as the default, and x=y=0 the top-left of the image.)

The structure of the selectors, however, I think could lead to problems for annotating biodiversity multimedia:

Comma-separated formats for fields are likely to lead to errors when used in CSV data.
There are existing use-cases which refer to single points in time/space rather than regions. (This could however be handled as regions of zero extent: t=10,10 or xywh=160,120,0,0.)
The format "t=10" for a time interval [10,end) risks user error since it could be interpreted as, or used as, a representation of temporal points. (In retrospect it would have been better to define the format as "t=10,")
We wish to provide for a frequency axis, with similar region-selection characteristics as the temporal and spatial. (See freqLow and freqHigh recently added to Audubon Core.)
We would like to allow for 3D spatial extents (xyzwhd?).

So, as one possibility: we could use the w3c’s approach to defining the values, by explicitly referring across to their use of RFC 2326 etc; but instead of simply recommending to use Media Fragments, we do NOT recommend the t or xywh selectors but instead recommend separate fields for timeStart, timeEnd, freqLow, freqHigh, and so forth.

I should say that my background is with audio data, and so for selecting image regions there may be existing good practice/recommendations that I haven't spotted.

My blog doesn't have a "comments" function, but I'd like to read your comments! You can reach me using twitter or email dstowell (attt) tilburguniversity.edu

Thu 18 February 2021 | science | Permalink

mcld.co.uk

Other things on this site...