Monday, September 29, 2025

Oracle DB 23ai supports external Apache Iceberg tables

 One area that I have been spending a lot of time on is accessing data stored in object storage from within Oracle DB 23ai using external table definitions.

Parquet objects

I started by creating external tables on Parquet objects, and recently I have been testing accessing objects that are cataloged in Apache Iceberg.


Alexey Filanovskiy wrote a great blog post on Oracle Tables and parquet objects you can find here.

Reading through that blog post gives you a good foundation for why parquet formatted objects make sense for external tables.  Oracle DB is able to leverage the metadata in parquet objects to optimize queries.

Defining external tables on parquet objects

Below is an example table definition for an external table accessing parquet objects. In the below example, the parquet objects were stored on ZFSSA using the OCI API.

CREATE TABLE "DW"."CUSTOMER_ADDRESS" 
   (    "CA_ADDRESS_SK" NUMBER(*,0), 
    "CA_ADDRESS_ID" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_STREET_NUMBER" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_STREET_NAME" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_STREET_TYPE" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_SUITE_NUMBER" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_CITY" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_COUNTY" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_STATE" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_ZIP" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_COUNTRY" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_GMT_OFFSET" NUMBER, 
    "CA_LOCATION_TYPE" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP"
   )  DEFAULT COLLATION "USING_NLS_COMP" 
   ORGANIZATION EXTERNAL 
    ( TYPE ORACLE_BIGDATA
      DEFAULT DIRECTORY "DATA_PUMP_DIR"
      ACCESS PARAMETERS
      ( com.oracle.bigdata.credential.name=OCI_ZFS
        com.oracle.bigdata.fileformat=parquet
          com.oracle.bigdata.trimspaces=notrim
       )
      LOCATION
       ( 'oraclebmc://pmdemo-zs71-01-client.us.oracle.com/n/tcptest/b/tcpds_10000g_parquet/o/customer_address/*'
       )
    )
   REJECT LIMIT UNLIMITED 
  PARALLEL ;
What you will notice from this table definition is
  • It uses the "ORACLE_BIGDATA" type. This is necessary to create external tables using big data object types, which is currently [csv|textfile|avro|parquet|orc].
  • I am specifying a credential. I had already created a credential that contains the login credentials for the OCI API on my ZFSSA.
  • The object type for my objects is parquet.
  • The location of the object(s) is specified. In my case I am using an "*" as there are multiple objects within this path that make up the table.
  • The format of the URL is in this case "ORACLEBMC". The options are
    • ORACLEBMC - Specifies this is an OCI API object storage.
    • S3 - Specifies amazon S3 object storage
    • S3h- Specifies a non-standard S3 object store
You can find additional information here.

Notes:

  1. The format of the URL is not well documented, I also found that in some cases "s3a" can be used for non-AWS s3 object storage.
  2. DBMS_CLOUD does not allow you to use URLs other than "https". You will most likely have to create tables using the above syntax rather than using the DBMS_CLOUD package.

Apache Iceberg

Once I was able to successfully access parquet objects directly, I began testing cataloging parquet objects with Apache Iceberg using Spark.
I found the most difficult part of this was properly creating an Apache Iceberg manifest file that could be used to build an external table definition against Iceberg.

Environment to build manifest


My testing environment to build the Apache Iceberg manifest contains the following.

Python

  • Python 3.9.18
  • Below is a list of my python packages that are installed
Python 3.9.18
Package             Version
------------------- -----------
aiobotocore         2.23.0
aiohappyeyeballs    2.6.1
aiohttp             3.12.14
aioitertools        0.12.0
aiosignal           1.4.0
annotated-types     0.7.0
async-timeout       5.0.1
attrs               20.3.0
Babel               2.9.1
bcc                 0.28.0
boto3               1.39.7
botocore            1.38.27
cachetools          5.5.2
certifi             2020.12.5
cffi                1.14.5
chardet             4.0.0
circuitbreaker      1.3.2
click               8.1.8
cloud-init          23.4
cockpit             311.1
colorama            0.4.6
configobj           5.0.6
configshell-fb      1.1.30
cryptography        36.0.1
dasbus              1.4
dbus-python         1.2.18
distro              1.5.0
docutils            0.16
file-magic          0.4.0
frozenlist          1.7.0
fsspec              2025.7.0
gpg                 1.15.1
greenlet            3.2.3
halo                0.0.31
idna                2.10
Jinja2              2.11.3
jmespath            1.0.1
jsonpatch           1.21
jsonpointer         2.0
jsonschema          3.2.0
kmod                0.1
Ksplice Uptrack     1.2.84
libcomps            0.1.18
lockfile            0.12.2
log-symbols         0.0.14
markdown-it-py      3.0.0
MarkupSafe          1.1.1
mdurl               0.1.2
mmh3                5.1.0
multidict           6.6.3
netaddr             0.8.0
netifaces           0.10.6
nftables            0.1
numpy               2.0.2
nvmetcli            0.7
oauthlib            3.1.1
oci                 2.126.3
oci-utils           0.14.0
oracledb            3.2.0
pandas              2.3.1
parquet_tools       0.2.16
pciutils            2.3.6
pcp                 5.0
perf                0.1
pexpect             4.8.0
pip                 25.1.1
ply                 3.11
prettytable         0.7.2
propcache           0.3.2
psutil              5.8.0
ptyprocess          0.6.0
py4j                0.10.9.9
pyarrow             20.0.0
pycparser           2.20
pycurl              7.43.0.6
pydantic            2.11.7
pydantic_core       2.33.2
Pygments            2.19.2
PyGObject           3.40.1
pyiceberg           0.9.1
pyOpenSSL           19.0.0
pyparsing           3.2.3
pyrsistent          0.17.3
pyserial            3.4
PySocks             1.7.1
pyspark             4.0.0
python-daemon       2.2.4
python-dateutil     2.9.0.post0
python-dmidecode    3.12.2
python-linux-procfs 0.7.3
pytz                2021.1
pyudev              0.22.0
PyYAML              5.4.1
requests            2.25.1
rhnlib              2.8.6
rich                13.9.4
rpm                 4.16.1.3
s3fs                2025.7.0
s3transfer          0.13.0
sdnotify            0.3.2
selinux             3.6
sepolicy            3.6
setools             4.4.4
setroubleshoot      3.3.32
setuptools          53.0.0
six                 1.15.0
sortedcontainers    2.4.0
sos                 4.7.0
spinners            0.0.24
SQLAlchemy          2.0.41
strictyaml          1.7.3
systemd-python      234
tabulate            0.9.0
tenacity            9.1.2
termcolor           3.1.0
thrift              0.16.0
typing_extensions   4.14.0
typing-inspection   0.4.1
tzdata              2025.2
urllib3             1.26.5
urwid               2.1.2
wrapt               1.17.2
yarl                1.20.1

PySpark & Spark runtime

PySpark version: 4.0.0
Java version: 21.0.7
Scala version: version 2.13.16

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.0.0
      /_/

Using Scala version 2.13.16, OpenJDK 64-Bit Server VM, 21.0.7
Branch HEAD
Compiled by user wenchen on 2025-05-19T07:58:03Z
Revision fa33ea000a0bda9e5a3fa1af98e8e85b8cc5e4d4
Url https://github.com/apache/spark

 Environment variables

HADOOP_OPTS= -Djava.library.path=/usr/local/hadoop/hadoop-3.4.1/lib/ -Djava.library.path=/usr/local/hadoop/hadoop-3.4.1/lib/ -Djava.library.path=/usr/local/hadoop/hadoop-3.4.1/lib/
JAVA_HOME=/usr/lib/jvm/java-21-openjdk-21.0.7.0.6-1.0.1.el9.x86_64
HADOOP_HOME=/usr/local/hadoop/hadoop-3.4.1
SPARK_HOME=/usr/local/spark
HADOOP_COMMON_LIB_NATIVE_DIR=/usr/local/hadoop/hadoop-3.4.1/lib/native/


Script to build manifest and store python files

I used the NY State Taxi data to build my objects, and it can be found here.
For my testing, I used the yellow taxi datasets.

Below is the section of my code that builds the spark catalog on OCI S3 object storage, and sets the environment using my credentials.

NOTE: It points to my endpoint and uses my access key and secret. I removed those from the script.

Python script to set the environment and hadoop catalog (Part #1)

from pyspark.sql import SparkSession

ICEBERG = "org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0"
HADOOP_CLOUD = "org.apache.spark:spark-hadoop-cloud_2.13:4.0.0"

spark = (
    SparkSession.builder
      .appName("iceberg-oci-s3")
      .config("spark.jars.packages", ",".join([ICEBERG, HADOOP_CLOUD]))
      # Iceberg catalog -> Hadoop catalog in an OCI S3 warehouse
      .config("spark.sql.catalog.oci", "org.apache.iceberg.spark.SparkCatalog")
      .config("spark.sql.catalog.oci.type", "hadoop")
      .config("spark.sql.catalog.oci.warehouse", "s3a://icebergs3/iceberg_warehouse")
      # OCI S3-compatible endpoint
      .config("spark.hadoop.fs.s3a.endpoint", "https://{my tenancy/namespace}.compat.objectstorage.us-ashburn-1.oraclecloud.com")
      .config("spark.hadoop.fs.s3a.path.style.access", "true")
      .config("spark.hadoop.fs.s3a.endpoint.region", "us-ashburn-1")
      .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
      .config("spark.hadoop.fs.s3a.list.version", "1")
      .getOrCreate()
)


hconf = spark.sparkContext._jsc.hadoopConfiguration()
hconf.set("fs.s3a.access.key",    "{my access key}")            # OCI 'Customer Secret Key' ID
hconf.set("fs.s3a.secret.key",    "{my secret for that key}")   # the secret key
hconf.set("fs.s3a.connection.ssl.enabled", "true")

Python script to create namespace and create table (Part #2)

spark.sql("""
  CREATE NAMESPACE IF NOT EXISTS oci_db
""")

import pyarrow.parquet as pq


spark.sql("""
  CREATE NAMESPACE IF NOT EXISTS oci.tripdata
""")

spark.sql("""
  CREATE TABLE IF NOT EXISTS oci.tripdata.yellow (
  `VendorID` INT,
  `tpep_pickup_datetime` TIMESTAMP,
  `tpep_dropoff_datetime` TIMESTAMP,
  `passenger_count` BIGINT,
  `trip_distance` DOUBLE,
  `RatecodeID` BIGINT,
  `store_and_fwd_flag` STRING,
  `PULocationID` INT,
  `DOLocationID` INT,
  `payment_type` BIGINT,
  `fare_amount` DOUBLE,
  `extra` DOUBLE,
  `mta_tax` DOUBLE,
  `tip_amount` DOUBLE,
  `tolls_amount` DOUBLE,
  `improvement_surcharge` DOUBLE,
  `total_amount` DOUBLE,
  `congestion_surcharge` DOUBLE,
  `Airport_fee` DOUBLE
  )
  USING iceberg
  PARTITIONED BY  (months(tpep_pickup_datetime))
""")


Python scrypt to read parquet object/set type/append to iceberg table  (Part #3)

sdf = spark.read.parquet("{parquet object}")  # Location and name of parquet object to load

from pyspark.sql import functions as F
sdf_cast = (
    sdf
    .withColumn("VendorID", F.col("VendorID").cast("int"))
    .withColumn("tpep_pickup_datetime", F.col("tpep_pickup_datetime").cast("timestamp"))
    .withColumn("tpep_dropoff_datetime", F.col("tpep_dropoff_datetime").cast("timestamp"))
    .withColumn("passenger_count", F.col("passenger_count").cast("bigint"))
    .withColumn("trip_distance", F.col("trip_distance").cast("double"))
    .withColumn("RatecodeID", F.col("RatecodeID").cast("bigint"))
    .withColumn("store_and_fwd_flag", F.col("store_and_fwd_flag").cast("string"))
    .withColumn("PULocationID", F.col("PULocationID").cast("int"))
    .withColumn("DOLocationID", F.col("DOLocationID").cast("int"))
    .withColumn("payment_type", F.col("payment_type").cast("bigint"))
    .withColumn("fare_amount", F.col("fare_amount").cast("double"))
    .withColumn("extra", F.col("extra").cast("double"))
    .withColumn("mta_tax", F.col("mta_tax").cast("double"))
    .withColumn("tip_amount", F.col("tip_amount").cast("double"))
    .withColumn("tolls_amount", F.col("tolls_amount").cast("double"))
    .withColumn("improvement_surcharge", F.col("improvement_surcharge").cast("double"))
    .withColumn("total_amount", F.col("total_amount").cast("double"))
    .withColumn("congestion_surcharge", F.col("congestion_surcharge").cast("double"))
    .withColumn("Airport_fee", F.col("Airport_fee").cast("double"))
)

sdf_cast.writeTo("oci.tripdata.yellow").append()

Summary :

Combining the 3 parts of the script, filling in the credentials along with the endpoint, and bucket, along with specifying the parquet object will load the object storage properly including the manifest file.

Investigating the resulting manifest and objects

Below is a list of the objects created in the metadata directory.


You can see that I updated the data and added more data, and each time, it created a new version of the manifest along with snapshot information.

When writing the data, I chose to partition it, below is the partitioned data. You can see that when Apache Iceberg wrote the data, it automatically created parquet objects in directories for each partition, and added to the current directory with new data.


Creating an Oracle table on top of Iceberg

Now that I have created my iceberg table, and it is stored in object storage, I can create an Oracle table that will read the manifest file.

In my example, the most recent manifest object is "iceberg_warehouse/tripdata/yellow/metadata/v7.metadata.json".

My External Table definition is below.
CREATE TABLE CUSTOMERS_ICEBERG
(
VendorID        integer,
tpep_pickup_datetime    timestamp,
tpep_dropoff_datetime    timestamp,
passenger_count        number,
trip_distance        number,
RatecodeID        number,
store_and_fwd_flag    blob,
PULocationID        number,
DOLocationID        number,
payment_type        number,
fare_amount        number,
extra            number,
mta_tax            number,
tip_amount        number,
tolls_amount        number,
improvement_surcharge    number,
total_amount        number,
congestion_surcharge    number,
Airport_fee        number
)
ORGANIZATION EXTERNAL
(
    TYPE ORACLE_BIGDATA
    DEFAULT DIRECTORY DATA_PUMP_DIR
    ACCESS PARAMETERS
    (
        com.oracle.bigdata.fileformat=parquet
        com.oracle.bigdata.credential.name=OCI_S3
        com.oracle.bigdata.access_protocol=iceberg
    )
   LOCATION ('iceberg:https://{namespace}.compat.objectstorage.us-ashburn-1.oci.customer-oci.com/icebergs3/iceberg_warehouse/tripdata/yellow/metadata/v7.metadata.json')
)
PARALLEL;

NOTE: my location is "iceberg:https".

I can now select from my new table and it will read the Iceberg manifest file.

Conclusion:

Oracle Database 23ai not only supports creating external tables on top of Parquet objects, but it also supports creating external tables on top of Apache Iceberg manifest objects.





No comments:

Post a Comment