codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ElasticSearch adapter. Exposing meta fields (like _id, _uid etc.)


I think a concrete example of a query with its respective logical plan
could clarify a few things. I would expect that the logical plan contains
all the necessary information to decide which fields need to be fetched
from Elastic without requiring additional aliases. If that information is
lost when passing to the physical plan then we could possibly add some
rules and/or operators to ensure this doesn't happen.

Στις Παρ, 28 Δεκ 2018 στις 11:37 μ.μ., ο/η Andrei Sereda <andrei@xxxxxxxxx>
έγραψε:

> Other than that it seems to me _MAP[‘_id’] will become a RexCall similar to
> ITEM($0, ‘_ID’) so it appears to me that you can retrieve the field name by
> implementing a RexVisitor.
>
> Yes. That’s what I’m doing (see isItem
> <
> https://github.com/apache/calcite/pull/982/files#diff-d1419ede28af115c0b45d6e02828cb6fR85
> >
> )
>
> I don’t see clearly why you need to set explicitly aliases but probably is
> because I am missing implementation details related with the adapter
> itself.
>
> For physical plan I get a list of expressions / projections (via
> RelDataType
> / getRowType) but at some point I need to know if (for example) EXPR$0 ==
> ITEM($0, '_id') (_id is a system field and needs to be fetched differently
> from payload). RelDataType doesn’t have item information.
>
> That’s the only reason I’m manually saving this mapping. Perhaps there is a
> better way to do it ?
>
> On Fri, Dec 28, 2018 at 2:08 AM Stamatis Zampetakis <zabetak@xxxxxxxxx>
> wrote:
>
> > Hi Andrei,
> >
> > I tried to have a look in the PR but I am not familiar with the Elastic
> > adapter code so I am missing a lot of context.
> >
> > In general, I think that relying on the on the field names coming from a
> > RelDataType it is not safe. As you observed, it can get messed up quite
> > easily from many parts of Calcite.
> > It is usually better to try to reason using the position and/or the type
> of
> > the field rather than its name.
> >
> > Other than that it seems to me _MAP['_id'] will become a RexCall similar
> to
> > ITEM($0, '_ID') so it appears to me that you can retrieve the field name
> by
> > implementing a RexVisitor.
> > I don't see clearly why you need to set explicitly aliases but probably
> is
> > because I am missing implementation details related with the adapter
> > itself.
> >
> > Best,
> > Stamatis
> >
> > Στις Πέμ, 27 Δεκ 2018 στις 8:36 μ.μ., ο/η Julian Hyde <jhyde@xxxxxxxxxx>
> > έγραψε:
> >
> > > I added comments to the case. I did not look at the PR, sorry.
> > >
> > > > On Dec 26, 2018, at 1:56 PM, Andrei Sereda <andrei@xxxxxxxxx> wrote:
> > > >
> > > > Hello,
> > > >
> > > > Can somebody pls take a look at the following PR:
> > > >
> > > > PR: https://github.com/apache/calcite/pull/982
> > > > JIRA: https://issues.apache.org/jira/browse/CALCITE-2755
> > > >
> > > > Specifically, I would like to understand if adding explicit mapping
> > > > <
> > >
> >
> https://github.com/apache/calcite/pull/982/files#diff-f24822f58a8062386da5287cd57cae65R75
> > > >
> > > > between expression (`EXPR$n`) and item (`_MAP['foo.bar']`) is a
> correct
> > > > (and recommended) approach. The reason I need it is because the
> > original
> > > > item name (eg. `_MAP['foo.bar']`) is lost after converter operation
> (ie
> > > I'm
> > > > not able to derive it from `input.getRowType()`)
> > > >
> > > > ```sql
> > > > -- Here I get EXPR$0 and EXPR$1 but no information about 'id' and
> > > 'foo.bar'
> > > > select _MAP['_id'], _MAP['foo.bar'] from elastic
> > > >
> > > > -- Query below preserves attribute information
> > > > select * from (select _MAP['_id'] as \"_id\", _MAP['foo.bar'] as
> > > > \"foo.bar\" from elastic)
> > > > ```
> > > >
> > > > Many thanks,
> > > > Andrei.
> > > >
> > > >
> > > >
> > > >
> > > > On Sat, Dec 22, 2018 at 3:11 AM Stamatis Zampetakis <
> zabetak@xxxxxxxxx
> > >
> > > > wrote:
> > > >
> > > >> Hi Andrei,
> > > >>
> > > >> Many data management systems have internal columns which the users
> may
> > > be
> > > >> able to query (see for example Postgres [1]). It doesn't sound
> > > unnatural to
> > > >> allow users access these fields. On the other hand such system
> columns
> > > do
> > > >> not appear in SELECT * queries and I suppose they are pruned out by
> > > >> projections early on if they are not used by the query. I guess you
> > > could
> > > >> take a similar approach in the ElasticSearch adapter.
> > > >>
> > > >> Best,
> > > >> Stamatis
> > > >>
> > > >> [1] https://www.postgresql.org/docs/current/ddl-system-columns.html
> > > >>
> > > >>
> > > >> Στις Σάβ, 22 Δεκ 2018 στις 1:14 π.μ., ο/η Andrei Sereda
> > > <andrei@xxxxxxxxx>
> > > >> έγραψε:
> > > >>
> > > >>> Hello,
> > > >>>
> > > >>> Upon indexing, elastic allocates a unique _id
> > > >>> <
> > > >>>
> > > >>
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-id-field.html
> > > >>>>
> > > >>> for each document (unless specified by the user). Currently when
> > > mapping
> > > >>> between elastic result and calcite return type we query only
> _source
> > or
> > > >>> fields attributes. _id is not part those but exposed at higher
> level
> > in
> > > >>> query response (see SearchHit
> > > >>> <
> > > >>>
> > > >>
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/6.1/_the_search_api.html
> > > >>>>
> > > >>> ).
> > > >>>
> > > >>> Currently  _MAP['foo'] becomes _source.get('foo') (or
> > > fields.get('foo')).
> > > >>>
> > > >>> Do you think we should make _MAP['_id'] available implicitly ?
> > > >>>
> > > >>> I have a couple of use-cases where one needs to know document ID.
> > > >>>
> > > >>> Please note this makes sense only for non-aggregate queries.
> > > >>>
> > > >>> Regards,
> > > >>>
> > > >>> Andrei.
> > > >>>
> > > >>
> > >
> > >
> >
>