codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Row format or bulk format


Hi Taher,

The row format is used to append data to part files record by record. When you implement Encoder interface, you can decide how to serialise record into bytes and write the bytes into the OutputStream of current part file. At the moment I do not see any facility classes in Flink code base which would implement Encoder interface for writing Avro records but I think it can be implemented using standard Avro utilities.

The bulk format is used to serialise and append a batch of records to part files. When you implement BulkWriter.Factory and BulkWriter interface, you can decide what to do with each record, when to finish the batch and write it into the OutputStream of current part file. Premature writing of buffered records should happen if BulkWriter.flush method is called, e.g. when part file is finished. Flink already comes with BulkWriters to write Avro records in parquet format provided by ParquetAvroWriters class.

Feel free to ask more specific questions. I will also cc Kostas who might give more implementation details.

Best,
Andrey

On Thu, Dec 27, 2018 at 12:24 PM Taher Koitawala <taher.koitawala@xxxxxxxxx> wrote:
Hi All,
         I am currently working on flink 1.7 with StreamingFileSink and need to write AVRO data on S3 Filesystem and the plan is to move from the old bucketing sink to the new sink which is much more compatible.
          Application is also using check pointing right now. 
Can someone please elaborate and explain the how the row format and the bulk works? Document only stresses on how they will be serialized.
 
Taher Koitawala
GS Lab Pune
+91 8407979163