Monday, September 29, 2025

Oracle DB 23ai supports external Apache Iceberg tables

 One area that I have been spending a lot of time on is accessing data stored in object storage from within Oracle DB 23ai using external table definitions.

Parquet objects

I started by creating external tables on Parquet objects, and recently I have been testing accessing objects that are cataloged in Apache Iceberg.


Alexey Filanovskiy wrote a great blog post on Oracle Tables and parquet objects you can find here.

Reading through that blog post gives you a good foundation for why parquet formatted objects make sense for external tables.  Oracle DB is able to leverage the metadata in parquet objects to optimize queries.

Defining external tables on parquet objects

Below is an example table definition for an external table accessing parquet objects. In the below example, the parquet objects were stored on ZFSSA using the OCI API.

CREATE TABLE "DW"."CUSTOMER_ADDRESS" 
   (    "CA_ADDRESS_SK" NUMBER(*,0), 
    "CA_ADDRESS_ID" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_STREET_NUMBER" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_STREET_NAME" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_STREET_TYPE" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_SUITE_NUMBER" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_CITY" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_COUNTY" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_STATE" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_ZIP" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_COUNTRY" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
    "CA_GMT_OFFSET" NUMBER, 
    "CA_LOCATION_TYPE" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP"
   )  DEFAULT COLLATION "USING_NLS_COMP" 
   ORGANIZATION EXTERNAL 
    ( TYPE ORACLE_BIGDATA
      DEFAULT DIRECTORY "DATA_PUMP_DIR"
      ACCESS PARAMETERS
      ( com.oracle.bigdata.credential.name=OCI_ZFS
        com.oracle.bigdata.fileformat=parquet
          com.oracle.bigdata.trimspaces=notrim
       )
      LOCATION
       ( 'oraclebmc://pmdemo-zs71-01-client.us.oracle.com/n/tcptest/b/tcpds_10000g_parquet/o/customer_address/*'
       )
    )
   REJECT LIMIT UNLIMITED 
  PARALLEL ;
What you will notice from this table definition is
  • It uses the "ORACLE_BIGDATA" type. This is necessary to create external tables using big data object types, which is currently [csv|textfile|avro|parquet|orc].
  • I am specifying a credential. I had already created a credential that contains the login credentials for the OCI API on my ZFSSA.
  • The object type for my objects is parquet.
  • The location of the object(s) is specified. In my case I am using an "*" as there are multiple objects within this path that make up the table.
  • The format of the URL is in this case "ORACLEBMC". The options are
    • ORACLEBMC - Specifies this is an OCI API object storage.
    • S3 - Specifies amazon S3 object storage
    • S3h- Specifies a non-standard S3 object store
You can find additional information here.

Notes:

  1. The format of the URL is not well documented, I also found that in some cases "s3a" can be used for non-AWS s3 object storage.
  2. DBMS_CLOUD does not allow you to use URLs other than "https". You will most likely have to create tables using the above syntax rather than using the DBMS_CLOUD package.

Apache Iceberg

Once I was able to successfully access parquet objects directly, I began testing cataloging parquet objects with Apache Iceberg using Spark.
I found the most difficult part of this was properly creating an Apache Iceberg manifest file that could be used to build an external table definition against Iceberg.

Environment to build manifest


My testing environment to build the Apache Iceberg manifest contains the following.

Python

  • Python 3.9.18
  • Below is a list of my python packages that are installed
Python 3.9.18
Package             Version
------------------- -----------
aiobotocore         2.23.0
aiohappyeyeballs    2.6.1
aiohttp             3.12.14
aioitertools        0.12.0
aiosignal           1.4.0
annotated-types     0.7.0
async-timeout       5.0.1
attrs               20.3.0
Babel               2.9.1
bcc                 0.28.0
boto3               1.39.7
botocore            1.38.27
cachetools          5.5.2
certifi             2020.12.5
cffi                1.14.5
chardet             4.0.0
circuitbreaker      1.3.2
click               8.1.8
cloud-init          23.4
cockpit             311.1
colorama            0.4.6
configobj           5.0.6
configshell-fb      1.1.30
cryptography        36.0.1
dasbus              1.4
dbus-python         1.2.18
distro              1.5.0
docutils            0.16
file-magic          0.4.0
frozenlist          1.7.0
fsspec              2025.7.0
gpg                 1.15.1
greenlet            3.2.3
halo                0.0.31
idna                2.10
Jinja2              2.11.3
jmespath            1.0.1
jsonpatch           1.21
jsonpointer         2.0
jsonschema          3.2.0
kmod                0.1
Ksplice Uptrack     1.2.84
libcomps            0.1.18
lockfile            0.12.2
log-symbols         0.0.14
markdown-it-py      3.0.0
MarkupSafe          1.1.1
mdurl               0.1.2
mmh3                5.1.0
multidict           6.6.3
netaddr             0.8.0
netifaces           0.10.6
nftables            0.1
numpy               2.0.2
nvmetcli            0.7
oauthlib            3.1.1
oci                 2.126.3
oci-utils           0.14.0
oracledb            3.2.0
pandas              2.3.1
parquet_tools       0.2.16
pciutils            2.3.6
pcp                 5.0
perf                0.1
pexpect             4.8.0
pip                 25.1.1
ply                 3.11
prettytable         0.7.2
propcache           0.3.2
psutil              5.8.0
ptyprocess          0.6.0
py4j                0.10.9.9
pyarrow             20.0.0
pycparser           2.20
pycurl              7.43.0.6
pydantic            2.11.7
pydantic_core       2.33.2
Pygments            2.19.2
PyGObject           3.40.1
pyiceberg           0.9.1
pyOpenSSL           19.0.0
pyparsing           3.2.3
pyrsistent          0.17.3
pyserial            3.4
PySocks             1.7.1
pyspark             4.0.0
python-daemon       2.2.4
python-dateutil     2.9.0.post0
python-dmidecode    3.12.2
python-linux-procfs 0.7.3
pytz                2021.1
pyudev              0.22.0
PyYAML              5.4.1
requests            2.25.1
rhnlib              2.8.6
rich                13.9.4
rpm                 4.16.1.3
s3fs                2025.7.0
s3transfer          0.13.0
sdnotify            0.3.2
selinux             3.6
sepolicy            3.6
setools             4.4.4
setroubleshoot      3.3.32
setuptools          53.0.0
six                 1.15.0
sortedcontainers    2.4.0
sos                 4.7.0
spinners            0.0.24
SQLAlchemy          2.0.41
strictyaml          1.7.3
systemd-python      234
tabulate            0.9.0
tenacity            9.1.2
termcolor           3.1.0
thrift              0.16.0
typing_extensions   4.14.0
typing-inspection   0.4.1
tzdata              2025.2
urllib3             1.26.5
urwid               2.1.2
wrapt               1.17.2
yarl                1.20.1

PySpark & Spark runtime

PySpark version: 4.0.0
Java version: 21.0.7
Scala version: version 2.13.16

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.0.0
      /_/

Using Scala version 2.13.16, OpenJDK 64-Bit Server VM, 21.0.7
Branch HEAD
Compiled by user wenchen on 2025-05-19T07:58:03Z
Revision fa33ea000a0bda9e5a3fa1af98e8e85b8cc5e4d4
Url https://github.com/apache/spark

 Environment variables

HADOOP_OPTS= -Djava.library.path=/usr/local/hadoop/hadoop-3.4.1/lib/ -Djava.library.path=/usr/local/hadoop/hadoop-3.4.1/lib/ -Djava.library.path=/usr/local/hadoop/hadoop-3.4.1/lib/
JAVA_HOME=/usr/lib/jvm/java-21-openjdk-21.0.7.0.6-1.0.1.el9.x86_64
HADOOP_HOME=/usr/local/hadoop/hadoop-3.4.1
SPARK_HOME=/usr/local/spark
HADOOP_COMMON_LIB_NATIVE_DIR=/usr/local/hadoop/hadoop-3.4.1/lib/native/


Script to build manifest and store python files

I used the NY State Taxi data to build my objects, and it can be found here.
For my testing, I used the yellow taxi datasets.

Below is the section of my code that builds the spark catalog on OCI S3 object storage, and sets the environment using my credentials.

NOTE: It points to my endpoint and uses my access key and secret. I removed those from the script.

Python script to set the environment and hadoop catalog (Part #1)

from pyspark.sql import SparkSession

ICEBERG = "org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0"
HADOOP_CLOUD = "org.apache.spark:spark-hadoop-cloud_2.13:4.0.0"

spark = (
    SparkSession.builder
      .appName("iceberg-oci-s3")
      .config("spark.jars.packages", ",".join([ICEBERG, HADOOP_CLOUD]))
      # Iceberg catalog -> Hadoop catalog in an OCI S3 warehouse
      .config("spark.sql.catalog.oci", "org.apache.iceberg.spark.SparkCatalog")
      .config("spark.sql.catalog.oci.type", "hadoop")
      .config("spark.sql.catalog.oci.warehouse", "s3a://icebergs3/iceberg_warehouse")
      # OCI S3-compatible endpoint
      .config("spark.hadoop.fs.s3a.endpoint", "https://{my tenancy/namespace}.compat.objectstorage.us-ashburn-1.oraclecloud.com")
      .config("spark.hadoop.fs.s3a.path.style.access", "true")
      .config("spark.hadoop.fs.s3a.endpoint.region", "us-ashburn-1")
      .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
      .config("spark.hadoop.fs.s3a.list.version", "1")
      .getOrCreate()
)


hconf = spark.sparkContext._jsc.hadoopConfiguration()
hconf.set("fs.s3a.access.key",    "{my access key}")            # OCI 'Customer Secret Key' ID
hconf.set("fs.s3a.secret.key",    "{my secret for that key}")   # the secret key
hconf.set("fs.s3a.connection.ssl.enabled", "true")

Python script to create namespace and create table (Part #2)

spark.sql("""
  CREATE NAMESPACE IF NOT EXISTS oci_db
""")

import pyarrow.parquet as pq


spark.sql("""
  CREATE NAMESPACE IF NOT EXISTS oci.tripdata
""")

spark.sql("""
  CREATE TABLE IF NOT EXISTS oci.tripdata.yellow (
  `VendorID` INT,
  `tpep_pickup_datetime` TIMESTAMP,
  `tpep_dropoff_datetime` TIMESTAMP,
  `passenger_count` BIGINT,
  `trip_distance` DOUBLE,
  `RatecodeID` BIGINT,
  `store_and_fwd_flag` STRING,
  `PULocationID` INT,
  `DOLocationID` INT,
  `payment_type` BIGINT,
  `fare_amount` DOUBLE,
  `extra` DOUBLE,
  `mta_tax` DOUBLE,
  `tip_amount` DOUBLE,
  `tolls_amount` DOUBLE,
  `improvement_surcharge` DOUBLE,
  `total_amount` DOUBLE,
  `congestion_surcharge` DOUBLE,
  `Airport_fee` DOUBLE
  )
  USING iceberg
  PARTITIONED BY  (months(tpep_pickup_datetime))
""")


Python scrypt to read parquet object/set type/append to iceberg table  (Part #3)

sdf = spark.read.parquet("{parquet object}")  # Location and name of parquet object to load

from pyspark.sql import functions as F
sdf_cast = (
    sdf
    .withColumn("VendorID", F.col("VendorID").cast("int"))
    .withColumn("tpep_pickup_datetime", F.col("tpep_pickup_datetime").cast("timestamp"))
    .withColumn("tpep_dropoff_datetime", F.col("tpep_dropoff_datetime").cast("timestamp"))
    .withColumn("passenger_count", F.col("passenger_count").cast("bigint"))
    .withColumn("trip_distance", F.col("trip_distance").cast("double"))
    .withColumn("RatecodeID", F.col("RatecodeID").cast("bigint"))
    .withColumn("store_and_fwd_flag", F.col("store_and_fwd_flag").cast("string"))
    .withColumn("PULocationID", F.col("PULocationID").cast("int"))
    .withColumn("DOLocationID", F.col("DOLocationID").cast("int"))
    .withColumn("payment_type", F.col("payment_type").cast("bigint"))
    .withColumn("fare_amount", F.col("fare_amount").cast("double"))
    .withColumn("extra", F.col("extra").cast("double"))
    .withColumn("mta_tax", F.col("mta_tax").cast("double"))
    .withColumn("tip_amount", F.col("tip_amount").cast("double"))
    .withColumn("tolls_amount", F.col("tolls_amount").cast("double"))
    .withColumn("improvement_surcharge", F.col("improvement_surcharge").cast("double"))
    .withColumn("total_amount", F.col("total_amount").cast("double"))
    .withColumn("congestion_surcharge", F.col("congestion_surcharge").cast("double"))
    .withColumn("Airport_fee", F.col("Airport_fee").cast("double"))
)

sdf_cast.writeTo("oci.tripdata.yellow").append()

Summary :

Combining the 3 parts of the script, filling in the credentials along with the endpoint, and bucket, along with specifying the parquet object will load the object storage properly including the manifest file.

Investigating the resulting manifest and objects

Below is a list of the objects created in the metadata directory.


You can see that I updated the data and added more data, and each time, it created a new version of the manifest along with snapshot information.

When writing the data, I chose to partition it, below is the partitioned data. You can see that when Apache Iceberg wrote the data, it automatically created parquet objects in directories for each partition, and added to the current directory with new data.


Creating an Oracle table on top of Iceberg

Now that I have created my iceberg table, and it is stored in object storage, I can create an Oracle table that will read the manifest file.

In my example, the most recent manifest object is "iceberg_warehouse/tripdata/yellow/metadata/v7.metadata.json".

My External Table definition is below.
CREATE TABLE CUSTOMERS_ICEBERG
(
VendorID        integer,
tpep_pickup_datetime    timestamp,
tpep_dropoff_datetime    timestamp,
passenger_count        number,
trip_distance        number,
RatecodeID        number,
store_and_fwd_flag    blob,
PULocationID        number,
DOLocationID        number,
payment_type        number,
fare_amount        number,
extra            number,
mta_tax            number,
tip_amount        number,
tolls_amount        number,
improvement_surcharge    number,
total_amount        number,
congestion_surcharge    number,
Airport_fee        number
)
ORGANIZATION EXTERNAL
(
    TYPE ORACLE_BIGDATA
    DEFAULT DIRECTORY DATA_PUMP_DIR
    ACCESS PARAMETERS
    (
        com.oracle.bigdata.fileformat=parquet
        com.oracle.bigdata.credential.name=OCI_S3
        com.oracle.bigdata.access_protocol=iceberg
    )
   LOCATION ('iceberg:https://{namespace}.compat.objectstorage.us-ashburn-1.oci.customer-oci.com/icebergs3/iceberg_warehouse/tripdata/yellow/metadata/v7.metadata.json')
)
PARALLEL;

NOTE: my location is "iceberg:https".

I can now select from my new table and it will read the Iceberg manifest file.

Conclusion:

Oracle Database 23ai not only supports creating external tables on top of Parquet objects, but it also supports creating external tables on top of Apache Iceberg manifest objects.





Wednesday, September 17, 2025

Building a Cyber Vault ? Don't forget your keys

When building a cyber vault, one of the most important items to manage is encryption keys.  Encrypting your data is a fundamental pillar of ransomware protection, but encryption key management is often forgotten.  
Ensuring your Cyber Vault has a current copy of your encryption keys associated with your Oracle Databases is critically important for a successful restore and recover after an attack.

Starting with the Oracle Key Vault (OKV) 21.11 release, Oracle Key Vault includes a preview of the Key Transfer Across Oracle Key Vault Clusters Integration Accelerator. You can use this feature to transfer security objects from one OKV cluster to another.

You can find more detail on this feature here, and I will describe it's benefits in this post.

The diagram below shows a typical cyber vault architecture to protect Oracle Databases with the Zero Data Loss Recovery Appliance (ZDLRA) and OKV .

Transferring keys into a cyber vault with OKV

Encryption Key Architecture

Encryption key management is a critical piece of data protection, and it is important to properly manage your keys.  Good cyber protection begins with proper encryption key management.

Local Wallets

With Oracle databases, the default (and simplest) location for encryption keys is in a locally management wallet file.  The keys are often stored in an auto-login wallet, which is automatically opened by the database at startup making key management transparent and simple, but not very secure.

Why aren't wallets secure ?

  • They are often auto-login (cwallet.sso) which allows the database to open the wallet without requiring a password. This wallet file gives full access to the encryption keys.
  • The wallet file is stored with the database files.  A privileged administrator, such as a DBA has access to both the database and the keys to decrypt the data directly from the hosts.  They also have the ability to delete the wallet file.
  • Often the wallet file is backed up with the database, which includes an auto-login wallet.  This allows anyone who has access to backups, to also be able to decrypt the data.
  • Securely backing up the wallet file separate from the database is often forgotten, especially when ASM is used as the wallet location.  Not having the wallet file when restoring the database makes restoration and recovery impossible.

Steps you can take

  • Create both a passworded wallet (ewallet.p12) and local auto-login wallet (cwallet.sso). With a local auto-login wallet, the wallet can only be opened on the host where the wallet was created.
  • Backup the passworded wallet only (ewallet.p12).  The auto-login wallet can always be recreated from the passworded wallet.
  • Properly store the password for you passworded encryption wallet. You will need the password to rotate your encryption keys, and create the auto-login wallet.

Oracle Key Vault (OKV)

 The best way to securely manage encryption keys is with OKV.

Why ?

  • Keys are managed outside of the database and cannot be accessed locally outside of the database.
  • Access to keys is granted to a specific database instance.
  • OKV is clustered for High Availability, and the OKV cluster can be securely backed up.
  • Key access is audited to allow for early detection of access.

Encryption Keys in a cyber vault architecture

Below is a diagram of a typical cyber vault architecture using ZDLRA.  Because the backups are encrypted, either because the databases are using TDE and/or they are creating RMAN encrypted backups sent to the ZDLRA, the keys need to be available in the vault also.
Not only is the cyber vault a separate, independent database and backup architecture, the vault also contains a separate, independent OKV cluster.
This isolates the cyber vault from any attack to the primary datacenter, including any attack that could compromise encryption key availability.


Encryption Keys in an advanced cyber vault architecture

Below is a diagram of an advanced cyber vault architecture using ZDLRA.  Not only are the backups replicated to a separate ZDLRA in the vault, they are internally replicated to an Isolated Recovery Environment (IRE).  In this architecture, the recover area is further isolated, and the OKV cluster is even further isolated from the primary datacenter. This provides the highest level of protection.


OKV Encryption Key Transfer


This blog post highlights the benefit of the newly released (21.11) OKV feature to allow for the secure transfer of encryption keys.
Periodic rotation of encryption keys is a required practice to protect encryption keys, and ensuring you have the current key available in a cyber vault is challenging.
OKV solves this challenge by providing the ability to transfer any new or changed keys between clusters.

Implementing OKV in a cyber Vault

When building a cyber vault, it is recommend to build an independent OKV cluster.  The OKV cluster in the vault is isolated from the primary datacenter and protected by an airgap.  The nodes in this cluster are  not be able to communicate with the OKV cluster outside of the vault.
The OKV cluster in the vault can be created using a full, secure, backup from the OKV cluster in the primary datacenter. The backup can be transferred into the vault, and then restored to the new, independent OKV cluster providing a current copy of the encryption keys.

Keeping OKV managed keys updated in a cyber Vault

The challenge, once creating an isolated OKV cluster has been keeping the encryption keys within the cluster current when new keys are created.  This was typically accomplished by transferring a full backup of OKV into the vault, and rebuilding the cluster using this backup.

OKV 21.11 provides the solution with secure Encryption Key Transfer.  Leveraging this feature you can securely transfer just the keys that have recently changed allowing you to manage independent OKV clusters that are synchronized on a regular basis.

The diagram below shows the flow of the secure Encryption Key Transfer package from the primary OKV cluster into the vault when the air-gap is opened.

This new OKV feature provides a much better way to securely manage encryption keys in a Cyber Vault.



Summary

As ransomware attacks increase, it is critical to protect a backup copy of your critical database in a cyber vault.  It is also critical to protect a copy of your encryption keys to ensure you can recovery those databases.  OKV provides the architecture for key management in both your primary datacenter and in a cyber vault.  The new secure key transfer feature within OKV allows you to synchronize keys across independent OKV clusters.