codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ElasticSearch adapter. Exposing meta fields (like _id, _uid etc.)


Let’s take an example of a simple query:

-- note _id fieldselect  _MAP['a'], _MAP['_id']  from elastic order by _MAP['c']

Logical plan is as follows:

ElasticsearchToEnumerableConverter
  ElasticsearchSort(sort0=[$2], dir0=[ASC])
    ElasticsearchProject(EXPR$0=[ITEM($0, 'a')], EXPR$1=[ITEM($0,
'_id')], EXPR$2=[ITEM($0, 'c')])
      ElasticsearchTableScan(table=[[elastic, foo]])

Questions

   1. In ElasticsearchEnumerators
   <https://github.com/asereda-gs/calcite/blob/elastic-idfield/elasticsearch/src/main/java/org/apache/calcite/adapter/elasticsearch/ElasticsearchEnumerators.java#L46>
   I need to know if requested field is a meta field (_id) or not (to
   correctly fetch it from payload). Does it mean one has to manually keep
   mapping between EXPR and ITEM somewhere ? Is there a better way ?
   2. In Sort Rel (ElasticsearchSort). How to know that sort0=[$2] (EXPR$2)
   is sort on ITEM($0, 'c') ? Mapping above (1) will help.

Thanks for your help.

On Sat, Dec 29, 2018 at 3:08 AM Stamatis Zampetakis <zabetak@xxxxxxxxx>
wrote:

> I think a concrete example of a query with its respective logical plan
> could clarify a few things. I would expect that the logical plan contains
> all the necessary information to decide which fields need to be fetched
> from Elastic without requiring additional aliases. If that information is
> lost when passing to the physical plan then we could possibly add some
> rules and/or operators to ensure this doesn't happen.
>
> Στις Παρ, 28 Δεκ 2018 στις 11:37 μ.μ., ο/η Andrei Sereda <andrei@xxxxxxxxx
> >
> έγραψε:
>
> > Other than that it seems to me _MAP[‘_id’] will become a RexCall similar
> to
> > ITEM($0, ‘_ID’) so it appears to me that you can retrieve the field name
> by
> > implementing a RexVisitor.
> >
> > Yes. That’s what I’m doing (see isItem
> > <
> >
> https://github.com/apache/calcite/pull/982/files#diff-d1419ede28af115c0b45d6e02828cb6fR85
> > >
> > )
> >
> > I don’t see clearly why you need to set explicitly aliases but probably
> is
> > because I am missing implementation details related with the adapter
> > itself.
> >
> > For physical plan I get a list of expressions / projections (via
> > RelDataType
> > / getRowType) but at some point I need to know if (for example) EXPR$0 ==
> > ITEM($0, '_id') (_id is a system field and needs to be fetched
> differently
> > from payload). RelDataType doesn’t have item information.
> >
> > That’s the only reason I’m manually saving this mapping. Perhaps there
> is a
> > better way to do it ?
> >
> > On Fri, Dec 28, 2018 at 2:08 AM Stamatis Zampetakis <zabetak@xxxxxxxxx>
> > wrote:
> >
> > > Hi Andrei,
> > >
> > > I tried to have a look in the PR but I am not familiar with the Elastic
> > > adapter code so I am missing a lot of context.
> > >
> > > In general, I think that relying on the on the field names coming from
> a
> > > RelDataType it is not safe. As you observed, it can get messed up quite
> > > easily from many parts of Calcite.
> > > It is usually better to try to reason using the position and/or the
> type
> > of
> > > the field rather than its name.
> > >
> > > Other than that it seems to me _MAP['_id'] will become a RexCall
> similar
> > to
> > > ITEM($0, '_ID') so it appears to me that you can retrieve the field
> name
> > by
> > > implementing a RexVisitor.
> > > I don't see clearly why you need to set explicitly aliases but probably
> > is
> > > because I am missing implementation details related with the adapter
> > > itself.
> > >
> > > Best,
> > > Stamatis
> > >
> > > Στις Πέμ, 27 Δεκ 2018 στις 8:36 μ.μ., ο/η Julian Hyde <
> jhyde@xxxxxxxxxx>
> > > έγραψε:
> > >
> > > > I added comments to the case. I did not look at the PR, sorry.
> > > >
> > > > > On Dec 26, 2018, at 1:56 PM, Andrei Sereda <andrei@xxxxxxxxx>
> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > Can somebody pls take a look at the following PR:
> > > > >
> > > > > PR: https://github.com/apache/calcite/pull/982
> > > > > JIRA: https://issues.apache.org/jira/browse/CALCITE-2755
> > > > >
> > > > > Specifically, I would like to understand if adding explicit mapping
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/calcite/pull/982/files#diff-f24822f58a8062386da5287cd57cae65R75
> > > > >
> > > > > between expression (`EXPR$n`) and item (`_MAP['foo.bar']`) is a
> > correct
> > > > > (and recommended) approach. The reason I need it is because the
> > > original
> > > > > item name (eg. `_MAP['foo.bar']`) is lost after converter operation
> > (ie
> > > > I'm
> > > > > not able to derive it from `input.getRowType()`)
> > > > >
> > > > > ```sql
> > > > > -- Here I get EXPR$0 and EXPR$1 but no information about 'id' and
> > > > 'foo.bar'
> > > > > select _MAP['_id'], _MAP['foo.bar'] from elastic
> > > > >
> > > > > -- Query below preserves attribute information
> > > > > select * from (select _MAP['_id'] as \"_id\", _MAP['foo.bar'] as
> > > > > \"foo.bar\" from elastic)
> > > > > ```
> > > > >
> > > > > Many thanks,
> > > > > Andrei.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Dec 22, 2018 at 3:11 AM Stamatis Zampetakis <
> > zabetak@xxxxxxxxx
> > > >
> > > > > wrote:
> > > > >
> > > > >> Hi Andrei,
> > > > >>
> > > > >> Many data management systems have internal columns which the users
> > may
> > > > be
> > > > >> able to query (see for example Postgres [1]). It doesn't sound
> > > > unnatural to
> > > > >> allow users access these fields. On the other hand such system
> > columns
> > > > do
> > > > >> not appear in SELECT * queries and I suppose they are pruned out
> by
> > > > >> projections early on if they are not used by the query. I guess
> you
> > > > could
> > > > >> take a similar approach in the ElasticSearch adapter.
> > > > >>
> > > > >> Best,
> > > > >> Stamatis
> > > > >>
> > > > >> [1]
> https://www.postgresql.org/docs/current/ddl-system-columns.html
> > > > >>
> > > > >>
> > > > >> Στις Σάβ, 22 Δεκ 2018 στις 1:14 π.μ., ο/η Andrei Sereda
> > > > <andrei@xxxxxxxxx>
> > > > >> έγραψε:
> > > > >>
> > > > >>> Hello,
> > > > >>>
> > > > >>> Upon indexing, elastic allocates a unique _id
> > > > >>> <
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-id-field.html
> > > > >>>>
> > > > >>> for each document (unless specified by the user). Currently when
> > > > mapping
> > > > >>> between elastic result and calcite return type we query only
> > _source
> > > or
> > > > >>> fields attributes. _id is not part those but exposed at higher
> > level
> > > in
> > > > >>> query response (see SearchHit
> > > > >>> <
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/6.1/_the_search_api.html
> > > > >>>>
> > > > >>> ).
> > > > >>>
> > > > >>> Currently  _MAP['foo'] becomes _source.get('foo') (or
> > > > fields.get('foo')).
> > > > >>>
> > > > >>> Do you think we should make _MAP['_id'] available implicitly ?
> > > > >>>
> > > > >>> I have a couple of use-cases where one needs to know document ID.
> > > > >>>
> > > > >>> Please note this makes sense only for non-aggregate queries.
> > > > >>>
> > > > >>> Regards,
> > > > >>>
> > > > >>> Andrei.
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> >
>