Re: Relational algebra and signal processing
I think the difficulty with JulianF’s signal processing domain is that he needs there to be precisely one record at every clock tick (or more generally, at every point in an N-dimensional discrete space).
Consider stock trading. A stock trade is an event that happens in continuous time, say
(9:58:02 ORCL 41), (10:01:55 ORCL 43)
Our query wants to know the stock price at 10:00 (or at any 1-minute interval). Therefore we have to convert the event-oriented data into an array:
(9:59 ORCL 41), (10:00 ORCL 41), (10:01 ORCL 41), (10:02 ORCL 43).
JulianF’s domain may be more naturally in the realm of array databases  but there are a lot of advantages to relational algebra and SQL, not least that we have reasonable story for streaming data, so let’s try to bridge the gap. Suppose we add a FILL operator that converts an event-based relation into a dense array:
FROM TABLE(FILL(Trades, ‘ROWTIME’, INTERVAL ‘1’ MINUTE))
Now we can safely join with other data at the same granularity.
Is this a step in the right direction?
> On Dec 18, 2018, at 7:05 AM, Michael Mior <mmior@xxxxxxxxxx> wrote:
> I would say a similar theory applies. Some things are different when you're
> dealing with streams. Mainly joins and aggregations. Semantics are
> necessarily different whenever you have operations involving more than one
> row at a time from the input stream. When dealing with a relation an
> aggregation is straightforward since you just consume the entire input, and
> output the result of the aggregation. Since streams don't end, you need to
> decide how this is handled which usually amounts to a choice of windowing
> algorithm. There are a few other things to think about. The presentation
> linked below from Julian Hyde has a nice overview
> Michael Mior
> Le mar. 18 déc. 2018 à 02:28, Julian Feinauer <j.feinauer@xxxxxxxxxxxxxxxxx>
> a écrit :
>> Hi Michael,
>> yes, our workloads are usually in the context of streaming (but for replay
>> or so we also use batch).
>> But, if I understand it correctly, the same theory applies to both, tables
>> ("relations") and streaming tables, or?
>> I hope to find time soon to write a PLC4X - Calicte source which creates
>> one or many streams based on readings from a plc.
>> Am 18.12.18, 03:19 schrieb "Michael Mior" <mmior@xxxxxxxxxx>:
>> Perhaps you've thought of this already, but it sounds like streaming
>> relational algebra could be a good fit here.
>> Michael Mior
>> Le dim. 16 déc. 2018 à 18:39, Julian Feinauer <
>> a écrit :
>>> Hi Calcite-devs,
>>> I just had a very interesting mail exchange with Julian (Hyde) on the
>>> incubator list . It was about our project CRUNCH (which is mostly
>>> time series analyses and signal processing) and its relation to
>>> algebra and I wanted to bring the discussion to this list to
>> continue here.
>>> We already had some discussion about how time series would work in
>>>  and it’s closely related to MATCH_RECOGNIZE.
>>> But, I have a more general question in mind, to ask the experts here
>>> the list.
>>> I ask myself if we can see the signal processing and analysis tasks
>>> proper application of relational algebra.
>>> Disclaimer, I’m mathematician, so I know the formals of (relational)
>>> algebra pretty well but I’m lacking a lot of experience and
>> knowledge in
>>> the database theory. Most of my knowledge there comes from Calcites
>>> code and the book from Garcia-Molina and Ullman).
>>> So if we take, for example, a stream of signals from a sensor, then
>> we can
>>> of course do filtering or smoothing on it and this can be seen as a
>>> between the input relation and the output relation. But as we
>> usually need
>>> more than just one tuple at a time we lose many of the advantages of
>>> relational theory. And then, if we analyze the signal, we can again
>>> it as a mapping between relations, but the input relation is a “time
>>> series” and the output relation consists of “events”, so these are
>> in some
>>> way different dimensions. In this situation it becomes mostly
>> obvious where
>>> the main differences between time series and relational algebra are.
>>> of something simple, an event should be registered, whenever the
>>> switches from FALSE to TRUE (so not for every TRUE). This could also
>>> modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels
>>> “unnatural” because we cannot use any indices (we don’t care about
>>> ratio of TRUE and FALSE in the DB, except for probably some very
>>> outer bounds). And we are lacking the “right” information for the
>>> like estimations on the number of analysis results.
>>> It gets even more complicated when moving to continuous valued
>>> (INT, DOUBLE, …), e.g., temperature readings or something.
>>> If we want to analyze the number of times where we have a temperature
>>> change of more than 5 degrees in under 4 hours, this should also be
>>> with MATCH_RECOGNIZE but again, there is no index to help us and we
>> have no
>>> information for the optimizer, so it feels very “black box” for the
>>> relational algebra.
>>> I’m not sure if you get my point, but for me, the elegance of
>>> algebra was always this optimization stuff, which comes from
>>> and ends in an “optimal” physical plan. And I do not see how we can
>>> much of this for the examples given above.
>>> Perhaps, one solution would be to do the same as for spatial queries
>>> the JSON / JSONB support in postgres, ) to add specialized
>>> statistics and optimizer rules. Then, this would make it more
>>> algebra”-esque in the sense that there really is a possibility to
>>> transformations to a given query.
>>> What do you think? Do I see things to complicated or am I missing
>>>  https://www.postgresql.org/docs/9.4/datatype-json.html