codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tripleo][ci] container pulls failing


On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz <aschultz at redhat.com> wrote:

> On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin at redhat.com>
> wrote:
> >
> >
> >
> > On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli at redhat.com>
> wrote:
> >>
> >> On 7/28/20 6:09 PM, Wesley Hayutin wrote:
> >> >
> >> >
> >> > On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien at redhat.com
> >> > <mailto:emilien at redhat.com>> wrote:
> >> >
> >> >
> >> >
> >> >     On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz at redhat.com
> >> >     <mailto:aschultz at redhat.com>> wrote:
> >> >
> >> >         On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi
> >> >         <emilien at redhat.com <mailto:emilien at redhat.com>> wrote:
> >> >          >
> >> >          >
> >> >          >
> >> >          > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin
> >> >         <whayutin at redhat.com <mailto:whayutin at redhat.com>> wrote:
> >> >          >>
> >> >          >> FYI...
> >> >          >>
> >> >          >> If you find your jobs are failing with an error similar to
> >> >         [1], you have been rate limited by docker.io <
> http://docker.io>
> >> >         via the upstream mirror system and have hit [2].  I've been
> >> >         discussing the issue w/ upstream infra, rdo-infra and a few CI
> >> >         engineers.
> >> >          >>
> >> >          >> There are a few ways to mitigate the issue however I don't
> >> >         see any of the options being completed very quickly so I'm
> >> >         asking for your patience while this issue is socialized and
> >> >         resolved.
> >> >          >>
> >> >          >> For full transparency we're considering the following
> options.
> >> >          >>
> >> >          >> 1. move off of docker.io <http://docker.io> to quay.io
> >> >         <http://quay.io>
> >> >          >
> >> >          >
> >> >          > quay.io <http://quay.io> also has API rate limit:
> >> >          > https://docs.quay.io/issues/429.html
> >> >          >
> >> >          > Now I'm not sure about how many requests per seconds one
> can
> >> >         do vs the other but this would need to be checked with the
> quay
> >> >         team before changing anything.
> >> >          > Also quay.io <http://quay.io> had its big downtimes as
> well,
> >> >         SLA needs to be considered.
> >> >          >
> >> >          >> 2. local container builds for each job in master, possibly
> >> >         ussuri
> >> >          >
> >> >          >
> >> >          > Not convinced.
> >> >          > You can look at CI logs:
> >> >          > - pulling / updating / pushing container images from
> >> >         docker.io <http://docker.io> to local registry takes ~10 min
> on
> >> >         standalone (OVH)
> >> >          > - building containers from scratch with updated repos and
> >> >         pushing them to local registry takes ~29 min on standalone
> (OVH).
> >> >          >
> >> >          >>
> >> >          >> 3. parent child jobs upstream where rpms and containers
> will
> >> >         be build and host artifacts for the child jobs
> >> >          >
> >> >          >
> >> >          > Yes, we need to investigate that.
> >> >          >
> >> >          >>
> >> >          >> 4. remove some portion of the upstream jobs to lower the
> >> >         impact we have on 3rd party infrastructure.
> >> >          >
> >> >          >
> >> >          > I'm not sure I understand this one, maybe you can give an
> >> >         example of what could be removed?
> >> >
> >> >         We need to re-evaulate our use of scenarios (e.g. we have two
> >> >         scenario010's both are non-voting).  There's a reason we
> >> >         historically
> >> >         didn't want to add more jobs because of these types of
> resource
> >> >         constraints.  I think we've added new jobs recently and likely
> >> >         need to
> >> >         reduce what we run. Additionally we might want to look into
> reducing
> >> >         what we run on stable branches as well.
> >> >
> >> >
> >> >     Oh... removing jobs (I thought we would remove some steps of the
> jobs).
> >> >     Yes big +1, this should be a continuous goal when working on CI,
> and
> >> >     always evaluating what we need vs what we run now.
> >> >
> >> >     We should look at:
> >> >     1) services deployed in scenarios that aren't worth testing (e.g.
> >> >     deprecated or unused things) (and deprecate the unused things)
> >> >     2) jobs themselves (I don't have any example beside scenario010
> but
> >> >     I'm sure there are more).
> >> >     --
> >> >     Emilien Macchi
> >> >
> >> >
> >> > Thanks Alex, Emilien
> >> >
> >> > +1 to reviewing the catalog and adjusting things on an ongoing basis.
> >> >
> >> > All.. it looks like the issues with docker.io <http://docker.io> were
> >> > more of a flare up than a change in docker.io <http://docker.io>
> policy
> >> > or infrastructure [2].  The flare up started on July 27 8am utc and
> >> > ended on July 27 17:00 utc, see screenshots.
> >>
> >> The numbers of image prepare workers and its exponential fallback
> >> intervals should be also adjusted. I've analysed the log snippet [0] for
> >> the connection reset counts by workers versus the times the rate
> >> limiting was triggered. See the details in the reported bug [1].
> >>
> >> tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110:
> >>
> >> Conn Reset Counts by a Worker PID:
> >>        3 58412
> >>        2 58413
> >>        3 58415
> >>        3 58417
> >>
> >> which seems too much of (workers*reconnects) and triggers rate limiting
> >> immediately.
> >>
> >> [0]
> >>
> https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcdn.com/741228/6/check/tripleo-ci-centos-8-undercloud-containers/8e47836/logs/undercloud/var/log/tripleo-container-image-prepare.log
> >>
> >> [1] https://bugs.launchpad.net/tripleo/+bug/1889372
> >>
> >> --
> >> Best regards,
> >> Bogdan Dobrelya,
> >> Irc #bogdando
> >>
> >
> > FYI..
> >
> > The issue w/ "too many requests" is back.  Expect delays and failures in
> attempting to merge your patches upstream across all branches.   The issue
> is being tracked as a critical issue.
>
> Working with the infra folks and we have identified the authorization
> header as causing issues when we're rediected from docker.io to
> cloudflare. I'll throw up a patch tomorrow to handle this case which
> should improve our usage of the cache.  It needs some testing against
> other registries to ensure that we don't break authenticated fetching
> of resources.
>
> Thanks Alex!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200729/7f58572e/attachment-0001.html>