Dapper in-situ conventions spec available

Steve Hankin Steven.C.Hankin at noaa.gov
Tue Oct 10 13:33:42 PDT 2006


Hi Joe,

I second you on the philosophy that you have articulated.   It seems 
clear that the first priority is to standardize current practice.  An 
OPeNDAP standard that handles collections of time series and profiles is 
the pressing target.

I would add that the nature of the discussion is not 
take-it-or-leave-it.  The standard can be constructed as an evolving 
thing.  At this stage it does not seem not wise to attempt to 
standardize types of data that have not been well tested -- e.g. 
satellite swaths.  Shouldn't those needs be handled through testbeds 
first?  Standardization should happen after the lessons of the testbed 
become clear.

There is an obvious impulse, however, to make sure that the standard is 
crafted in a way that does not rule out future extensions -- e.g. 
handling swaths in the relatively near future.  One way to handle this 
might be to include appendices that describe possible encodings of these 
data without making them officially part of the standard at this stage.  
That will guide the testbeds, making a clear target to be evaluated.

    - Steve

P.S. Regarding the specific question of whether it is wise to limit the 
datatype to Float32, only, for the inner sequence -- both sides of the 
argument have merit. A first order requirement of this standard is to 
hide (via a web service (lowercase) definition) the difference between 
data maintained as a collection of netCDF files and data in (say) a 
relational database.  I believe that there are already examples of 
databases where profile data are stored as int (or even short), with a 
scale and offset provided in order to recover the floating point (fixed 
decimal point) representation.  So there may, indeed, be a requirement 
(though weak) for greater flexibility in binary encodings.

This type of "packed"/"compressed" encoding can be borrowed from the 
netCDF style guides (and the CF standard).  The stronger argument, to 
me, is that we should borrow as much as reasonable from the CF 
encodings.  Again -- maybe the best solution is a compromise: version 1 
of DAPPER should be restricted to the Float32 encoding, with a paragraph 
in an appendix that discusses broadening to other datatypes in version 2.

=================================================

Joe Sirott wrote:
> Hi John, Ethan, Ted, etc:
>
> Thanks for the extensive comments.
>
> I'll respond to some of the specific comments in a later e-mail. But 
> first some philosophy...
>
> There is, of course, a trade-off between having a completely general 
> data model that will permit any kind of data to be represented and a 
> more rigid model that's designed to only represent one kind of data. 
> In the first case, it's relatively easy to fit data into the model, 
> but it's very difficult to write a client that will work reliably with 
> the model. I think that's one reason why there are so few clients that 
> can read OPeNDAP sequences --sequences can model all kinds of in-situ 
> data, but there are so many possibilities (n-level sequences, numerous 
> data types) that no single client can handle all them. Generalized 
> models certainly make it easier for data providers (most of the people 
> on this list) to serve data. But no one ends up using data from these 
> servers -- it's too much work to write a client. In the latter case, 
> it's easy to write a client but the rigidity of the model forces the 
> data provider to use separate incompatible data models for different 
> kinds of data and, once again, it's too much work to write a client 
> that supports all the different simple, but incompatible, models.
>
> I'm trying to strike a reasonable middle ground with the Dapper 
> convention. It should be flexible enough to support many (but not all) 
> kinds of in-situ data. It should be relatively easy to write a client 
> for the data. General data types should be avoided if a specific data 
> type will work just as well. For example, inner sequence variables are 
> always Float32 because all of the in-situ measurements that I can 
> think of can be represented as a single precision floating point 
> number (with the exception of time, which needs to be a Float64). Why 
> make it more complicated by allowing bytes, short ints, unsigned short 
> ints, unsigned ints, strings, etc? Similarly, attributes are /always 
> /strings and the outer sequence only contains ids and coordinate 
> variables because it makes it easier for a client to determine which 
> variables can be constrained. I think that John and Ted's idea of 
> expanding the _id variable type to include strings fits in with this 
> philosophy because it permits the use of natural keys. But I don't 
> think that _id should be any type -- just Int32 or String (or maybe 
> just String).
>
> The Dapper model works well with many kinds of station data (both 
> time-series and profile data). It doesn't work well with trajectory 
> data. And I'm not convinced that Ted's satellite data (which is 
> similar to trajectories) would be a good fit, either. Both of these 
> cases could probably be shoehorned into the Dapper model. Is that a 
> good idea? Perhaps there should be a separate satellite swath 
> data/trajectory convention for these.
>
> - Joe
>
> John Caron wrote:
>> Hi All:
>>
>> Here are some specific comments on the spec:
>>
>> 1. "The inner sequence contains all of the measurement variables". 
>>     Generalization: It could be reasonable to allow measurements in 
>> the outer sequence, to save space and clarify invariants.
>>
>> 2. "The outer sequence must have an Int32 variable named _id and this 
>> variable must have a unique value for each entry of the outer sequence".
>>     Generalization: _id could be any type, including String.
>>
>> 3. "The outer sequence must have two variables that specify the x 
>> (longitude) axis and y (latitude) axis of the dataset. These 
>> variables are identified by the axis attribute and have values of "X" 
>> and "Y" respectively."     Clarity: ... have values of "lat" and 
>> "lon" respectively.
>>
>> 4. "The x,y, t, and z axes must have a Float32 or Float64 type."
>>    Generalization: t could alternately be an ISO 8601 formatted String.
>>
>> 5. "The sequence may have an attributes variable (of type Structure) 
>> that contains per-profile or per-time series metadata as a set of 
>> name-value pairs... All members of the structure must have a String 
>> type."
>>    - Are these handled in any specail way? Is there a need to 
>> seperate the measurements from the metadata? Otherwise, these could 
>> just be variables in the inner sequence.
>>   - Can there be other data in the outer sequence, eg a station id 
>> string? Is this attribute structure  a way to eliminate that?
>>
>> 6. "The sequence may have a variable_attributes variable of type 
>> Structure that contains per-profile or per-time series metadata for a 
>> specific data variable". The example, though, seems to describe 
>> variable metadata as a whole, not per-profile. In cany case, I wonder 
>> if these cant just be variables in the inner sequence?
>>
>> 7.  Why do the inner sequence variables have to be floats?
>>
>> 8. How does the structure variable constrained_ranges differ from the 
>> global attributes lon_range, lat_range, etc?
>>
>> 9. max_profiles_per_request could be unlimited?
>>
>> 10. total_profiles_in_dataset could be unknown ?
>>
>> 11. "A server must support all selection constrains on coordinate 
>> variables."
>>     - Does this mean any combination of constraints, even very long 
>> and complex ones? A useful restriction might be just space/time 
>> bounding boxes.
>>     - on all coordinate variables? a useful restriction might be only 
>> on outer coordinate variables.
>>
>> 12. The DAS example implies that _id, lat, lon, time, depth can have 
>> missing values. It seems like one might want to dissallow that ?
>>   
>> A. Stepping back, this spec allows:
>>  - sets of variable length structures (i.e. 2D ragged arrays)
>>  - constraint selections allow queries that subset the dateset by 
>> space and time.
>>  - the _id allows a smart client to first get the ids of the inner 
>> sequences that have been subsetted by space and time, and then 
>> retrieve those inner sequences by id.
>>
>> B. One type of data we try to deal with is track/trajectory data 
>> where x, y, z, and t can all vary from observation to observation 
>> (e.g., data taken during an aircraft flight). So something like:
>>
>> netcdf trajectory_one.nc {
>> dimensions:
>>   traj = 5;
>>   time = 2848;   // (has coord.var)
>>   variables:
>>   String traj(traj);
>>   double time(time, traj);
>>   double depth(time, traj);
>>   double latitude(time, traj);
>>   double longitude(time, traj);
>>
>>   double temperature(time, traj);
>> }
>>
>> A Dapper-ish view might look like:
>>
>> Dataset {
>>    Sequence {
>>        Int32 _id;
>>        Sequence {
>>            Float64 time;
>>            Float64 lat;
>>            Float64 lon;
>>            Float64 elev;
>>
>>            Float32 temp;
>>        } observation;
>>    } trajectory;
>>    ...
>> } my_trajectory_collection;
>>
>> In this case, the x,y,z, and t coordinates are all in the inner 
>> sequence. One could imagine other datasets that want some coordinates 
>> in the inner and some in the outer sequence.
>>
>> C. Providing all possible selections on a sequence is a real 
>> implementation problem. One could state it as: how do you let the 
>> client know what selections are efficient. It might be interesting to 
>> provide some metadata conventions that do that in the general case, eg:
>>
>> NC_GLOBAL {
>>   String selection_efficient "x y _id inner.depth";
>>   String selection_forbidden "inner.*";
>>   String selection_forbidden "inner.*";
>> }
>>
>> says "selections on x, y, _id, and inner.depth are efficient, 
>> anything else on inner sequence is forbidded, anything else is 
>> allowed but may be slow". Of course, clients would have to get smart 
>> to use that info.  This is equivilent to examining a database schema. 
>> In that sense, the info might belong in the DDS somehow. OTOH, the 
>> DAP-2 spec says that an opendap server must allow projections, so 
>> does that make a dapper server non-compliant?
>> D. BTW,  we've been working on a netCDF-3 file  convention for 
>> storing observation data, and we are studying if we can map these 
>> files into this spec. A lot of the problems have to do with 
>> efficiently implementing the projections. See
>>
>>  http://www.unidata.ucar.edu/software/netcdf-java/formats/UnidataObsConvention.html). 
>>
>> if you are interested.
>>
>> John and Ethan
>

-- 
--

Steve Hankin, NOAA/PMEL -- Steven.C.Hankin at noaa.gov
7600 Sand Point Way NE, Seattle, WA 98115-0070
ph. (206) 526-6080, FAX (206) 526-6744



More information about the Opendap-tech mailing list