codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tripleo][ci] container pulls failing


On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien at redhat.com> wrote:

>
>
> On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz at redhat.com> wrote:
>
>> On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi <emilien at redhat.com>
>> wrote:
>> >
>> >
>> >
>> > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin <whayutin at redhat.com>
>> wrote:
>> >>
>> >> FYI...
>> >>
>> >> If you find your jobs are failing with an error similar to [1], you
>> have been rate limited by docker.io via the upstream mirror system and
>> have hit [2].  I've been discussing the issue w/ upstream infra, rdo-infra
>> and a few CI engineers.
>> >>
>> >> There are a few ways to mitigate the issue however I don't see any of
>> the options being completed very quickly so I'm asking for your patience
>> while this issue is socialized and resolved.
>> >>
>> >> For full transparency we're considering the following options.
>> >>
>> >> 1. move off of docker.io to quay.io
>> >
>> >
>> > quay.io also has API rate limit:
>> > https://docs.quay.io/issues/429.html
>> >
>> > Now I'm not sure about how many requests per seconds one can do vs the
>> other but this would need to be checked with the quay team before changing
>> anything.
>> > Also quay.io had its big downtimes as well, SLA needs to be considered.
>> >
>> >> 2. local container builds for each job in master, possibly ussuri
>> >
>> >
>> > Not convinced.
>> > You can look at CI logs:
>> > - pulling / updating / pushing container images from docker.io to
>> local registry takes ~10 min on standalone (OVH)
>> > - building containers from scratch with updated repos and pushing them
>> to local registry takes ~29 min on standalone (OVH).
>> >
>> >>
>> >> 3. parent child jobs upstream where rpms and containers will be build
>> and host artifacts for the child jobs
>> >
>> >
>> > Yes, we need to investigate that.
>> >
>> >>
>> >> 4. remove some portion of the upstream jobs to lower the impact we
>> have on 3rd party infrastructure.
>> >
>> >
>> > I'm not sure I understand this one, maybe you can give an example of
>> what could be removed?
>>
>> We need to re-evaulate our use of scenarios (e.g. we have two
>> scenario010's both are non-voting).  There's a reason we historically
>> didn't want to add more jobs because of these types of resource
>> constraints.  I think we've added new jobs recently and likely need to
>> reduce what we run. Additionally we might want to look into reducing
>> what we run on stable branches as well.
>>
>
> Oh... removing jobs (I thought we would remove some steps of the jobs).
> Yes big +1, this should be a continuous goal when working on CI, and
> always evaluating what we need vs what we run now.
>
> We should look at:
> 1) services deployed in scenarios that aren't worth testing (e.g.
> deprecated or unused things) (and deprecate the unused things)
> 2) jobs themselves (I don't have any example beside scenario010 but I'm
> sure there are more).
> --
> Emilien Macchi
>

Thanks Alex, Emilien

+1 to reviewing the catalog and adjusting things on an ongoing basis.

All.. it looks like the issues with docker.io were more of a flare up than
a change in docker.io policy or infrastructure [2].  The flare up started
on July 27 8am utc and ended on July 27 17:00 utc, see screenshots.

I've socialized the issue with the CI team and some ways to reduce our
reliance on docker.io or any public registry.  Sagi and I have a draft
design that we'll share on this list after a first round of a POC.  We also
thought we'd leverage Emilien's awesome work [1] to build containers
locally in standalone for widely to reduce our traffic to docker.io and
upstream proxies.

TLDR, feel free to recheck and wf.  Thanks for your patience!!

[1] https://review.opendev.org/#/q/status:open++topic:dos_docker.io
[2] link to logstash query be sure to change the time range
<http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22429%20Client%20Error%3A%20Too%20Many%20Requests%20for%20url%3A%5C%22%20AND%20voting%3A1>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200728/c7def987/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: docker.io_2.png
Type: image/png
Size: 93167 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200728/c7def987/attachment-0001.png>