codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

python3, regular expression and bytes text


On Sun, Oct 13, 2019 at 7:16 AM Richard Damon <Richard at damon-family.org> wrote:
>
> On 10/12/19 3:46 PM, Eko palypse wrote:
> > Thank you very much for your answer.
> >
> >> You have to be able to match bytes, not strings.
> > May I ask you to elaborate on this, sorry non-native English speaker.
> > The buffer I receive is a byte-like buffer.
> >
> >> I don't think you'll be able to 100% reliably match bytes in this way.
> >> You're asking it to make analysis of multiple bytes and to interpret
> >> them according to which character they would represent if decoded from
> >> UTF-8.
> >>
> >> My recommendation: Even if your buffer is multiple gigabytes, just
> >> decode it anyway. Maybe you can decode your buffer in chunks, but
> >> otherwise, just bite the bullet and do the decode. You may be
> >> pleasantly surprised at how little you suffer as a result; Python is
> >> quite decent at memory management, and even if you DO get pushed into
> >> the swapper by this, it's still likely to be faster than trying to
> >> code around all the possible problems that come from mismatching your
> >> text search.
> >>
> >> ChrisA
> > That's what I was afraid of.
> > It would be nice if the "world" could commit itself to one standard,
> > but I'm afraid that won't happen in my life anymore, I guess. :-(
> >
> > Thx
> > Eren
>
> Current 'best practices' are in my opinion to convert data (if needed)
> to some version of Unicode (UTF-8, UTF-16, or UCS-4) at input (if
> needed) and process in that domain.

Specifically, convert to abstract Unicode text, not to any of those
byte encodings.

> You do need to be prepared to run
> into files which are encoded in some locally defined 8-bit code page. In
> Python3,  strings are unicode encoded, and you don't need to worry about
> the details of which encoding is used internally, Python will deal with
> that itself.
>

Yes. A Python 3 string consists of Unicode characters (technically,
it's a sequence of Unicode code points), and you don't have to worry
about encodings any more than you need to worry about the details of
the IEEE 754 packed formats in order to use floating-point numbers.
You should be able to just use the abstract "text string" as a
fundamental concept.

ChrisA