One area that I have been spending a lot of time on is accessing data stored in object storage from within Oracle DB 23ai using external table definitions.
Parquet objects
I started by creating external tables on Parquet objects, and recently I have been testing accessing objects that are cataloged in Apache Iceberg.
Alexey Filanovskiy wrote a great blog post on Oracle Tables and parquet objects you can find here.
Reading through that blog post gives you a good foundation for why parquet formatted objects make sense for external tables. Oracle DB is able to leverage the metadata in parquet objects to optimize queries.
Defining external tables on parquet objects
Below is an example table definition for an external table accessing parquet objects. In the below example, the parquet objects were stored on ZFSSA using the OCI API.
CREATE TABLE "DW"."CUSTOMER_ADDRESS"
( "CA_ADDRESS_SK" NUMBER(*,0),
"CA_ADDRESS_ID" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP",
"CA_STREET_NUMBER" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP",
"CA_STREET_NAME" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP",
"CA_STREET_TYPE" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP",
"CA_SUITE_NUMBER" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP",
"CA_CITY" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP",
"CA_COUNTY" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP",
"CA_STATE" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP",
"CA_ZIP" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP",
"CA_COUNTRY" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP",
"CA_GMT_OFFSET" NUMBER,
"CA_LOCATION_TYPE" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP"
) DEFAULT COLLATION "USING_NLS_COMP"
ORGANIZATION EXTERNAL
( TYPE ORACLE_BIGDATA
DEFAULT DIRECTORY "DATA_PUMP_DIR"
ACCESS PARAMETERS
( com.oracle.bigdata.credential.name=OCI_ZFS
com.oracle.bigdata.fileformat=parquet
com.oracle.bigdata.trimspaces=notrim
)
LOCATION
( 'oraclebmc://pmdemo-zs71-01-client.us.oracle.com/n/tcptest/b/tcpds_10000g_parquet/o/customer_address/*'
)
)
REJECT LIMIT UNLIMITED
PARALLEL ;
What you will notice from this table definition is- It uses the "ORACLE_BIGDATA" type. This is necessary to create external tables using big data object types, which is currently [csv|textfile|avro|parquet|orc].
- I am specifying a credential. I had already created a credential that contains the login credentials for the OCI API on my ZFSSA.
- The object type for my objects is parquet.
- The location of the object(s) is specified. In my case I am using an "*" as there are multiple objects within this path that make up the table.
- The format of the URL is in this case "ORACLEBMC". The options are
- ORACLEBMC - Specifies this is an OCI API object storage.
- S3 - Specifies amazon S3 object storage
- S3h- Specifies a non-standard S3 object store
Notes:
- The format of the URL is not well documented, I also found that in some cases "s3a" can be used for non-AWS s3 object storage.
- DBMS_CLOUD does not allow you to use URLs other than "https". You will most likely have to create tables using the above syntax rather than using the DBMS_CLOUD package.
Apache Iceberg
Environment to build manifest
Python
- Python 3.9.18
- Below is a list of my python packages that are installed
Python 3.9.18
Package Version
------------------- -----------
aiobotocore 2.23.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.14
aioitertools 0.12.0
aiosignal 1.4.0
annotated-types 0.7.0
async-timeout 5.0.1
attrs 20.3.0
Babel 2.9.1
bcc 0.28.0
boto3 1.39.7
botocore 1.38.27
cachetools 5.5.2
certifi 2020.12.5
cffi 1.14.5
chardet 4.0.0
circuitbreaker 1.3.2
click 8.1.8
cloud-init 23.4
cockpit 311.1
colorama 0.4.6
configobj 5.0.6
configshell-fb 1.1.30
cryptography 36.0.1
dasbus 1.4
dbus-python 1.2.18
distro 1.5.0
docutils 0.16
file-magic 0.4.0
frozenlist 1.7.0
fsspec 2025.7.0
gpg 1.15.1
greenlet 3.2.3
halo 0.0.31
idna 2.10
Jinja2 2.11.3
jmespath 1.0.1
jsonpatch 1.21
jsonpointer 2.0
jsonschema 3.2.0
kmod 0.1
Ksplice Uptrack 1.2.84
libcomps 0.1.18
lockfile 0.12.2
log-symbols 0.0.14
markdown-it-py 3.0.0
MarkupSafe 1.1.1
mdurl 0.1.2
mmh3 5.1.0
multidict 6.6.3
netaddr 0.8.0
netifaces 0.10.6
nftables 0.1
numpy 2.0.2
nvmetcli 0.7
oauthlib 3.1.1
oci 2.126.3
oci-utils 0.14.0
oracledb 3.2.0
pandas 2.3.1
parquet_tools 0.2.16
pciutils 2.3.6
pcp 5.0
perf 0.1
pexpect 4.8.0
pip 25.1.1
ply 3.11
prettytable 0.7.2
propcache 0.3.2
psutil 5.8.0
ptyprocess 0.6.0
py4j 0.10.9.9
pyarrow 20.0.0
pycparser 2.20
pycurl 7.43.0.6
pydantic 2.11.7
pydantic_core 2.33.2
Pygments 2.19.2
PyGObject 3.40.1
pyiceberg 0.9.1
pyOpenSSL 19.0.0
pyparsing 3.2.3
pyrsistent 0.17.3
pyserial 3.4
PySocks 1.7.1
pyspark 4.0.0
python-daemon 2.2.4
python-dateutil 2.9.0.post0
python-dmidecode 3.12.2
python-linux-procfs 0.7.3
pytz 2021.1
pyudev 0.22.0
PyYAML 5.4.1
requests 2.25.1
rhnlib 2.8.6
rich 13.9.4
rpm 4.16.1.3
s3fs 2025.7.0
s3transfer 0.13.0
sdnotify 0.3.2
selinux 3.6
sepolicy 3.6
setools 4.4.4
setroubleshoot 3.3.32
setuptools 53.0.0
six 1.15.0
sortedcontainers 2.4.0
sos 4.7.0
spinners 0.0.24
SQLAlchemy 2.0.41
strictyaml 1.7.3
systemd-python 234
tabulate 0.9.0
tenacity 9.1.2
termcolor 3.1.0
thrift 0.16.0
typing_extensions 4.14.0
typing-inspection 0.4.1
tzdata 2025.2
urllib3 1.26.5
urwid 2.1.2
wrapt 1.17.2
yarl 1.20.1
PySpark & Spark runtime
PySpark version: 4.0.0
Java version: 21.0.7
Scala version: version 2.13.16
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 4.0.0
/_/
Using Scala version 2.13.16, OpenJDK 64-Bit Server VM, 21.0.7
Branch HEAD
Compiled by user wenchen on 2025-05-19T07:58:03Z
Revision fa33ea000a0bda9e5a3fa1af98e8e85b8cc5e4d4
Url https://github.com/apache/spark
Environment variables
HADOOP_OPTS= -Djava.library.path=/usr/local/hadoop/hadoop-3.4.1/lib/ -Djava.library.path=/usr/local/hadoop/hadoop-3.4.1/lib/ -Djava.library.path=/usr/local/hadoop/hadoop-3.4.1/lib/
JAVA_HOME=/usr/lib/jvm/java-21-openjdk-21.0.7.0.6-1.0.1.el9.x86_64
HADOOP_HOME=/usr/local/hadoop/hadoop-3.4.1
SPARK_HOME=/usr/local/spark
HADOOP_COMMON_LIB_NATIVE_DIR=/usr/local/hadoop/hadoop-3.4.1/lib/native/
Script to build manifest and store python files
Python script to set the environment and hadoop catalog (Part #1)
from pyspark.sql import SparkSession
ICEBERG = "org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0"
HADOOP_CLOUD = "org.apache.spark:spark-hadoop-cloud_2.13:4.0.0"
spark = (
SparkSession.builder
.appName("iceberg-oci-s3")
.config("spark.jars.packages", ",".join([ICEBERG, HADOOP_CLOUD]))
# Iceberg catalog -> Hadoop catalog in an OCI S3 warehouse
.config("spark.sql.catalog.oci", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.oci.type", "hadoop")
.config("spark.sql.catalog.oci.warehouse", "s3a://icebergs3/iceberg_warehouse")
# OCI S3-compatible endpoint
.config("spark.hadoop.fs.s3a.endpoint", "https://{my tenancy/namespace}.compat.objectstorage.us-ashburn-1.oraclecloud.com")
.config("spark.hadoop.fs.s3a.path.style.access", "true")
.config("spark.hadoop.fs.s3a.endpoint.region", "us-ashburn-1")
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
.config("spark.hadoop.fs.s3a.list.version", "1")
.getOrCreate()
)
hconf = spark.sparkContext._jsc.hadoopConfiguration()
hconf.set("fs.s3a.access.key", "{my access key}") # OCI 'Customer Secret Key' ID
hconf.set("fs.s3a.secret.key", "{my secret for that key}") # the secret key
hconf.set("fs.s3a.connection.ssl.enabled", "true")
Python script to create namespace and create table (Part #2)
spark.sql("""
CREATE NAMESPACE IF NOT EXISTS oci_db
""")
import pyarrow.parquet as pq
spark.sql("""
CREATE NAMESPACE IF NOT EXISTS oci.tripdata
""")
spark.sql("""
CREATE TABLE IF NOT EXISTS oci.tripdata.yellow (
`VendorID` INT,
`tpep_pickup_datetime` TIMESTAMP,
`tpep_dropoff_datetime` TIMESTAMP,
`passenger_count` BIGINT,
`trip_distance` DOUBLE,
`RatecodeID` BIGINT,
`store_and_fwd_flag` STRING,
`PULocationID` INT,
`DOLocationID` INT,
`payment_type` BIGINT,
`fare_amount` DOUBLE,
`extra` DOUBLE,
`mta_tax` DOUBLE,
`tip_amount` DOUBLE,
`tolls_amount` DOUBLE,
`improvement_surcharge` DOUBLE,
`total_amount` DOUBLE,
`congestion_surcharge` DOUBLE,
`Airport_fee` DOUBLE
)
USING iceberg
PARTITIONED BY (months(tpep_pickup_datetime))
""")
Python scrypt to read parquet object/set type/append to iceberg table (Part #3)
sdf = spark.read.parquet("{parquet object}") # Location and name of parquet object to load
from pyspark.sql import functions as F
sdf_cast = (
sdf
.withColumn("VendorID", F.col("VendorID").cast("int"))
.withColumn("tpep_pickup_datetime", F.col("tpep_pickup_datetime").cast("timestamp"))
.withColumn("tpep_dropoff_datetime", F.col("tpep_dropoff_datetime").cast("timestamp"))
.withColumn("passenger_count", F.col("passenger_count").cast("bigint"))
.withColumn("trip_distance", F.col("trip_distance").cast("double"))
.withColumn("RatecodeID", F.col("RatecodeID").cast("bigint"))
.withColumn("store_and_fwd_flag", F.col("store_and_fwd_flag").cast("string"))
.withColumn("PULocationID", F.col("PULocationID").cast("int"))
.withColumn("DOLocationID", F.col("DOLocationID").cast("int"))
.withColumn("payment_type", F.col("payment_type").cast("bigint"))
.withColumn("fare_amount", F.col("fare_amount").cast("double"))
.withColumn("extra", F.col("extra").cast("double"))
.withColumn("mta_tax", F.col("mta_tax").cast("double"))
.withColumn("tip_amount", F.col("tip_amount").cast("double"))
.withColumn("tolls_amount", F.col("tolls_amount").cast("double"))
.withColumn("improvement_surcharge", F.col("improvement_surcharge").cast("double"))
.withColumn("total_amount", F.col("total_amount").cast("double"))
.withColumn("congestion_surcharge", F.col("congestion_surcharge").cast("double"))
.withColumn("Airport_fee", F.col("Airport_fee").cast("double"))
)
sdf_cast.writeTo("oci.tripdata.yellow").append()
Summary :
Investigating the resulting manifest and objects
Creating an Oracle table on top of Iceberg
CREATE TABLE CUSTOMERS_ICEBERG
(
VendorID integer,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp,
passenger_count number,
trip_distance number,
RatecodeID number,
store_and_fwd_flag blob,
PULocationID number,
DOLocationID number,
payment_type number,
fare_amount number,
extra number,
mta_tax number,
tip_amount number,
tolls_amount number,
improvement_surcharge number,
total_amount number,
congestion_surcharge number,
Airport_fee number
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_BIGDATA
DEFAULT DIRECTORY DATA_PUMP_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.fileformat=parquet
com.oracle.bigdata.credential.name=OCI_S3
com.oracle.bigdata.access_protocol=iceberg
)
LOCATION ('iceberg:https://{namespace}.compat.objectstorage.us-ashburn-1.oci.customer-oci.com/icebergs3/iceberg_warehouse/tripdata/yellow/metadata/v7.metadata.json')
)
PARALLEL;
No comments:
Post a Comment