Dapper in-situ conventions spec available

James Gallagher jhrg at mac.com
Tue Oct 10 22:31:05 PDT 2006


On Oct 10, 2006, at 1:31 PM, Joe Sirott wrote:

> Hi John, Ethan, Ted, etc:
>
> Thanks for the extensive comments.
>
> I'll respond to some of the specific comments in a later e-mail.  
> But first some philosophy...
>
> There is, of course, a trade-off between having a completely  
> general data model that will permit any kind of data to be  
> represented and a more rigid model that's designed to only  
> represent one kind of data. In the first case, it's relatively easy  
> to fit data into the model, but it's very difficult to write a  
> client that will work reliably with the model. I think that's one  
> reason why there are so few clients that can read OPeNDAP sequences  
> --sequences can model all kinds of in-situ data, but there are so  
> many possibilities (n-level sequences, numerous data types) that no  
> single client can handle all them. Generalized models certainly  
> make it easier for data providers (most of the people on this list)  
> to serve data. But no one ends up using data from these servers --  
> it's too much work to write a client. In the latter case, it's easy  
> to write a client but the rigidity of the model forces the data  
> provider to use separate incompatible data models for different  
> kinds of data and, once again, it's too much work to write a client  
> that supports all the different simple, but incompatible, models.

I agree.

>
> I'm trying to strike a reasonable middle ground with the Dapper  
> convention. It should be flexible enough to support many (but not  
> all) kinds of in-situ data. It should be relatively easy to write a  
> client for the data. General data types should be avoided if a  
> specific data type will work just as well. For example, inner  
> sequence variables are always Float32 because all of the in-situ  
> measurements that I can think of can be represented as a single  
> precision floating point number (with the exception of time, which  
> needs to be a Float64). Why make it more complicated by allowing  
> bytes, short ints, unsigned short ints, unsigned ints, strings,  
> etc? Similarly, attributes are always strings and the outer  
> sequence only contains ids and coordinate variables because it  
> makes it easier for a client to determine which variables can be  
> constrained. I think that John and Ted's idea of expanding the _id  
> variable type to include strings fits in with this philosophy  
> because it permits the use of natural keys. But I don't think that  
> _id should be any type -- just Int32 or String (or maybe just String).

String and Float (32/64) have such a useful range that they are very  
good choices. JGOFS (the prototypical system on which Sequence is  
based) used only strings in its default configuration, although it  
could also be convinced to use float.

James
>
> The Dapper model works well with many kinds of station data (both  
> time-series and profile data). It doesn't work well with trajectory  
> data. And I'm not convinced that Ted's satellite data (which is  
> similar to trajectories) would be a good fit, either. Both of these  
> cases could probably be shoehorned into the Dapper model. Is that a  
> good idea? Perhaps there should be a separate satellite swath data/ 
> trajectory convention for these.
>
> - Joe
>
> John Caron wrote:
>> Hi All:
>>
>> Here are some specific comments on the spec:
>>
>> 1. "The inner sequence contains all of the measurement  
>> variables".     Generalization: It could be reasonable to allow  
>> measurements in the outer sequence, to save space and clarify  
>> invariants.
>>
>> 2. "The outer sequence must have an Int32 variable named _id and  
>> this variable must have a unique value for each entry of the outer  
>> sequence".
>>     Generalization: _id could be any type, including String.
>>
>> 3. "The outer sequence must have two variables that specify the x  
>> (longitude) axis and y (latitude) axis of the dataset. These  
>> variables are identified by the axis attribute and have values of  
>> "X" and "Y" respectively."     Clarity: ... have values of "lat"  
>> and "lon" respectively.
>>
>> 4. "The x,y, t, and z axes must have a Float32 or Float64 type."
>>    Generalization: t could alternately be an ISO 8601 formatted  
>> String.
>>
>> 5. "The sequence may have an attributes variable (of type  
>> Structure) that contains per-profile or per-time series metadata  
>> as a set of name-value pairs... All members of the structure must  
>> have a String type."
>>    - Are these handled in any specail way? Is there a need to  
>> seperate the measurements from the metadata? Otherwise, these  
>> could just be variables in the inner sequence.
>>   - Can there be other data in the outer sequence, eg a station id  
>> string? Is this attribute structure  a way to eliminate that?
>>
>> 6. "The sequence may have a variable_attributes variable of type  
>> Structure that contains per-profile or per-time series metadata  
>> for a specific data variable". The example, though, seems to  
>> describe variable metadata as a whole, not per-profile. In cany  
>> case, I wonder if these cant just be variables in the inner sequence?
>>
>> 7.  Why do the inner sequence variables have to be floats?
>>
>> 8. How does the structure variable constrained_ranges differ from  
>> the global attributes lon_range, lat_range, etc?
>>
>> 9. max_profiles_per_request could be unlimited?
>>
>> 10. total_profiles_in_dataset could be unknown ?
>>
>> 11. "A server must support all selection constrains on coordinate  
>> variables."
>>     - Does this mean any combination of constraints, even very  
>> long and complex ones? A useful restriction might be just space/ 
>> time bounding boxes.
>>     - on all coordinate variables? a useful restriction might be  
>> only on outer coordinate variables.
>>
>> 12. The DAS example implies that _id, lat, lon, time, depth can  
>> have missing values. It seems like one might want to dissallow that ?
>>
>> A. Stepping back, this spec allows:
>>  - sets of variable length structures (i.e. 2D ragged arrays)
>>  - constraint selections allow queries that subset the dateset by  
>> space and time.
>>  - the _id allows a smart client to first get the ids of the inner  
>> sequences that have been subsetted by space and time, and then  
>> retrieve those inner sequences by id.
>>
>> B. One type of data we try to deal with is track/trajectory data  
>> where x, y, z, and t can all vary from observation to observation  
>> (e.g., data taken during an aircraft flight). So something like:
>>
>> netcdf trajectory_one.nc {
>> dimensions:
>>   traj = 5;
>>   time = 2848;   // (has coord.var)
>>   variables:
>>   String traj(traj);
>>   double time(time, traj);
>>   double depth(time, traj);
>>   double latitude(time, traj);
>>   double longitude(time, traj);
>>
>>   double temperature(time, traj);
>> }
>>
>> A Dapper-ish view might look like:
>>
>> Dataset {
>>    Sequence {
>>        Int32 _id;
>>        Sequence {
>>            Float64 time;
>>            Float64 lat;
>>            Float64 lon;
>>            Float64 elev;
>>
>>            Float32 temp;
>>        } observation;
>>    } trajectory;
>>    ...
>> } my_trajectory_collection;
>>
>> In this case, the x,y,z, and t coordinates are all in the inner  
>> sequence. One could imagine other datasets that want some  
>> coordinates in the inner and some in the outer sequence.
>>
>> C. Providing all possible selections on a sequence is a real  
>> implementation problem. One could state it as: how do you let the  
>> client know what selections are efficient. It might be interesting  
>> to provide some metadata conventions that do that in the general  
>> case, eg:
>>
>> NC_GLOBAL {
>>   String selection_efficient "x y _id inner.depth";
>>   String selection_forbidden "inner.*";
>>   String selection_forbidden "inner.*";
>> }
>>
>> says "selections on x, y, _id, and inner.depth are efficient,  
>> anything else on inner sequence is forbidded, anything else is  
>> allowed but may be slow". Of course, clients would have to get  
>> smart to use that info.  This is equivilent to examining a  
>> database schema. In that sense, the info might belong in the DDS  
>> somehow. OTOH, the DAP-2 spec says that an opendap server must  
>> allow projections, so does that make a dapper server non-compliant?
>> D. BTW,  we've been working on a netCDF-3 file  convention for  
>> storing observation data, and we are studying if we can map these  
>> files into this spec. A lot of the problems have to do with  
>> efficiently implementing the projections. See
>>
>>  http://www.unidata.ucar.edu/software/netcdf-java/formats/ 
>> UnidataObsConvention.html).
>> if you are interested.
>>
>> John and Ethan
>

--
James Gallagher                jgallagher at opendap.org
OPeNDAP, Inc                   406.723.8663

-------------- next part --------------



More information about the Opendap-tech mailing list