Dapper in-situ conventions spec available

Joe Sirott Joe.Sirott at noaa.gov
Tue Oct 10 12:31:19 PDT 2006


Hi John, Ethan, Ted, etc:

Thanks for the extensive comments.

I'll respond to some of the specific comments in a later e-mail. But 
first some philosophy...

There is, of course, a trade-off between having a completely general 
data model that will permit any kind of data to be represented and a 
more rigid model that's designed to only represent one kind of data. In 
the first case, it's relatively easy to fit data into the model, but 
it's very difficult to write a client that will work reliably with the 
model. I think that's one reason why there are so few clients that can 
read OPeNDAP sequences --sequences can model all kinds of in-situ data, 
but there are so many possibilities (n-level sequences, numerous data 
types) that no single client can handle all them. Generalized models 
certainly make it easier for data providers (most of the people on this 
list) to serve data. But no one ends up using data from these servers -- 
it's too much work to write a client. In the latter case, it's easy to 
write a client but the rigidity of the model forces the data provider to 
use separate incompatible data models for different kinds of data and, 
once again, it's too much work to write a client that supports all the 
different simple, but incompatible, models.

I'm trying to strike a reasonable middle ground with the Dapper 
convention. It should be flexible enough to support many (but not all) 
kinds of in-situ data. It should be relatively easy to write a client 
for the data. General data types should be avoided if a specific data 
type will work just as well. For example, inner sequence variables are 
always Float32 because all of the in-situ measurements that I can think 
of can be represented as a single precision floating point number (with 
the exception of time, which needs to be a Float64). Why make it more 
complicated by allowing bytes, short ints, unsigned short ints, unsigned 
ints, strings, etc? Similarly, attributes are /always /strings and the 
outer sequence only contains ids and coordinate variables because it 
makes it easier for a client to determine which variables can be 
constrained. I think that John and Ted's idea of expanding the _id 
variable type to include strings fits in with this philosophy because it 
permits the use of natural keys. But I don't think that _id should be 
any type -- just Int32 or String (or maybe just String).

The Dapper model works well with many kinds of station data (both 
time-series and profile data). It doesn't work well with trajectory 
data. And I'm not convinced that Ted's satellite data (which is similar 
to trajectories) would be a good fit, either. Both of these cases could 
probably be shoehorned into the Dapper model. Is that a good idea? 
Perhaps there should be a separate satellite swath data/trajectory 
convention for these.

- Joe

John Caron wrote:
> Hi All:
>
> Here are some specific comments on the spec:
>
> 1. "The inner sequence contains all of the measurement variables".     
> Generalization: It could be reasonable to allow measurements in the 
> outer sequence, to save space and clarify invariants.
>
> 2. "The outer sequence must have an Int32 variable named _id and this 
> variable must have a unique value for each entry of the outer sequence".
>     Generalization: _id could be any type, including String.
>
> 3. "The outer sequence must have two variables that specify the x 
> (longitude) axis and y (latitude) axis of the dataset. These variables 
> are identified by the axis attribute and have values of "X" and "Y" 
> respectively."     Clarity: ... have values of "lat" and "lon" 
> respectively.
>
> 4. "The x,y, t, and z axes must have a Float32 or Float64 type."
>    Generalization: t could alternately be an ISO 8601 formatted String.
>
> 5. "The sequence may have an attributes variable (of type Structure) 
> that contains per-profile or per-time series metadata as a set of 
> name-value pairs... All members of the structure must have a String 
> type."
>    - Are these handled in any specail way? Is there a need to seperate 
> the measurements from the metadata? Otherwise, these could just be 
> variables in the inner sequence.
>   - Can there be other data in the outer sequence, eg a station id 
> string? Is this attribute structure  a way to eliminate that?
>
> 6. "The sequence may have a variable_attributes variable of type 
> Structure that contains per-profile or per-time series metadata for a 
> specific data variable". The example, though, seems to describe 
> variable metadata as a whole, not per-profile. In cany case, I wonder 
> if these cant just be variables in the inner sequence?
>
> 7.  Why do the inner sequence variables have to be floats?
>
> 8. How does the structure variable constrained_ranges differ from the 
> global attributes lon_range, lat_range, etc?
>
> 9. max_profiles_per_request could be unlimited?
>
> 10. total_profiles_in_dataset could be unknown ?
>
> 11. "A server must support all selection constrains on coordinate 
> variables."
>     - Does this mean any combination of constraints, even very long 
> and complex ones? A useful restriction might be just space/time 
> bounding boxes.
>     - on all coordinate variables? a useful restriction might be only 
> on outer coordinate variables.
>
> 12. The DAS example implies that _id, lat, lon, time, depth can have 
> missing values. It seems like one might want to dissallow that ?
>   
> A. Stepping back, this spec allows:
>  - sets of variable length structures (i.e. 2D ragged arrays)
>  - constraint selections allow queries that subset the dateset by 
> space and time.
>  - the _id allows a smart client to first get the ids of the inner 
> sequences that have been subsetted by space and time, and then 
> retrieve those inner sequences by id.
>
> B. One type of data we try to deal with is track/trajectory data where 
> x, y, z, and t can all vary from observation to observation (e.g., 
> data taken during an aircraft flight). So something like:
>
> netcdf trajectory_one.nc {
> dimensions:
>   traj = 5;
>   time = 2848;   // (has coord.var)
>   variables:
>   String traj(traj);
>   double time(time, traj);
>   double depth(time, traj);
>   double latitude(time, traj);
>   double longitude(time, traj);
>
>   double temperature(time, traj);
> }
>
> A Dapper-ish view might look like:
>
> Dataset {
>    Sequence {
>        Int32 _id;
>        Sequence {
>            Float64 time;
>            Float64 lat;
>            Float64 lon;
>            Float64 elev;
>
>            Float32 temp;
>        } observation;
>    } trajectory;
>    ...
> } my_trajectory_collection;
>
> In this case, the x,y,z, and t coordinates are all in the inner 
> sequence. One could imagine other datasets that want some coordinates 
> in the inner and some in the outer sequence.
>
> C. Providing all possible selections on a sequence is a real 
> implementation problem. One could state it as: how do you let the 
> client know what selections are efficient. It might be interesting to 
> provide some metadata conventions that do that in the general case, eg:
>
> NC_GLOBAL {
>   String selection_efficient "x y _id inner.depth";
>   String selection_forbidden "inner.*";
>   String selection_forbidden "inner.*";
> }
>
> says "selections on x, y, _id, and inner.depth are efficient, anything 
> else on inner sequence is forbidded, anything else is allowed but may 
> be slow". Of course, clients would have to get smart to use that 
> info.  This is equivilent to examining a database schema. In that 
> sense, the info might belong in the DDS somehow. OTOH, the DAP-2 spec 
> says that an opendap server must allow projections, so does that make 
> a dapper server non-compliant?
> D. BTW,  we've been working on a netCDF-3 file  convention for storing 
> observation data, and we are studying if we can map these files into 
> this spec. A lot of the problems have to do with efficiently 
> implementing the projections. See
>
>  http://www.unidata.ucar.edu/software/netcdf-java/formats/UnidataObsConvention.html). 
>
> if you are interested.
>
> John and Ethan



More information about the Opendap-tech mailing list