[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Created] (ARROW-4027) Reading partitioned datasets using RecordBatchFileReader into R

Jeffrey Wong created ARROW-4027:

             Summary: Reading partitioned datasets using RecordBatchFileReader into R
                 Key: ARROW-4027
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 0.11.1
         Environment: Ubuntu 16.04, building R package from master on github
            Reporter: Jeffrey Wong

I have a parquet dataset (which originally came from Hive) stored locally in the directory `data/`. It has 4 files in it


Using pyarrow I can read them via


I am trying to read them into R using `read_table("data/foo1")`, but I receive this error.

 Error in ipc___RecordBatchFileReader__Open(file) : 
 Invalid: Not an Arrow file 

>From debugging, I've traced it to this line, which then goes to this Rcpp code It seems that this c++ function is expecting a single "[file like object]("; I think because my data is split, there is a footer that is supposed to contain a file layout and schema which cannot be found, hence the error Not an Arrow file.


If I pass the whole directory using `read_table("data/")` I will get

Error in ipc___RecordBatchFileReader__Open(file) : 
 IOError: Error reading bytes from file: Is a directory 




I cannot post the original dataset online, and I don't know what aspect of my data causes the code to break, so I don't quite know how to post a reproducible example. Tips on how to generate a partitioned dataset would be great

This message was sent by Atlassian JIRA