Learnable function

A learnable function is a mathematical function that can be learned from data, typically through a machine learning algorithm, to minimize errors and perform a specific task. In the context of statistical learning theory, this refers to a function class where an algorithm can identify a specific function that best approximates the underlying pattern or distribution within in the data.^[1]

In the domain of database systems, specifically within the emerging field of AI-powered database systems, learnable function is a general concept, representing a paradigm shift where traditional, hard-coded system heuristics are replaced by these parameterized functions. This enables the DBMS to learn, adapt, and govern based on the data it manages. Instead of relying on static rules designed by human experts, database components utilize learnable functions to dynamically adapt to specific data distributions and workloads.

Definition

Fundamentally, a learnable function can be formalized as a mapping $f:{\mathcal {X}}\to {\mathcal {Y}}$ , parameterized by a set of weights $\theta$ . The goal of the learning process is to find the optimal parameters $\theta ^{*}$ that minimize a specific loss function $L$ over a dataset ${\mathcal {D}}$ :

\theta ^{*}=\arg \min _{\theta }\sum _{(x,y)\in {\mathcal {D}}}L(f(x|\theta ),y)

In this framework:

$x$ represents the input features (covariates).
$y$ represents the target output or label.
$f$ is the hypothesis or model (e.g., a linear regression model, a decision tree, or a Neural network).
The "learnability" of the function ensures that as the size of the dataset ${\mathcal {D}}$ increases, the empirical risk converges to the true expected risk, allowing the function to generalize to unseen data.

Formulation in AI-powered Database Systems

In traditional database design, system decisions, such as which index to use, how to order joins, or how to handle transaction conflicts, are governed by static functions, denoted as $f_{static}(x)$ . These are typically composed of "if-then" rules and cost models based on general assumptions, e.g., uniform data distribution, established by human engineers.

In the context of AI-powered database systems, $f_{static}$ is replaced by a learnable function $f_{learned}(x|\theta )$ . This reformulation treats database internals as approximation problems:

Input ( $x$ ): Database states, query syntax, or runtime statistics, e.g., lock contention levels, data cardinality.
Output ( $y$ ): System actions or estimates, e.g., estimated selectivity, optimal concurrency action.
Optimization: The function parameters $\theta$ are tuned via methods such as Supervised learning (using query logs) or Reinforcement learning (using runtime feedback).

Implementations in Database Systems

The implementation of learnable functions varies based on the complexity of $\theta$ and the inference latency requirements.

Lookup Tables / Discrete Mappings: For low-latency requirements, e.g., concurrency control, $f(x|\theta )$ may be implemented as a learned lookup table where $x$ is mapped to discrete buckets, and $\theta$ represents the stored actions in those buckets.
Classical ML Models: Linear models and Tree-based models (e.g., XGBoost) are often used for cost estimation where interpretability and training speed are balanced.
Deep Learning: Neural networks are employed for high-dimensional mappings, such as cardinality estimation over complex joins, where $x$ is a vector embedding of the SQL query.

Applications in Learnable Database Components

Learned Concurrency Control

Concurrency Control (CC) algorithms ensure transaction isolation. Traditional algorithms like Two-phase locking (2PL) or Optimistic concurrency control (OCC) perform well in specific scenarios but fail to generalize. Recent research proposes treating concurrency control as a learnable function^[2]. In such a model, the CC policy can be regarded an agent function:

F:\{s\}\to \{a\}

$s$ (State): Features such as data hotness, dependency graph depth, and operation types.
$a$ (Action): A combination of conflict detection (e.g., validate read set) and resolution mechanisms (e.g., wait, abort, commit).

Unlike "Classify-then-Assign" approaches which merely select between 2PL and OCC, a learnable function approach can generate novel hybrid policies by learning the optimal action $a$ for a specific state $s$ via Bayesian optimization or generic search, effectively creating a "lookup table" of optimal behaviors that adapts to contention levels.

Learned Query Optimization

In query optimization, the learnable function $f$ typically replaces the cost model or the cardinality estimator.

Cardinality Estimation: $f(query|\theta )\to row\_count$ . Deep learning models can capture correlations between columns that histograms (independence assumption) fail to model.
Cost Modeling: $f(plan|\theta )\to latency$ . Learning the latency of physical operators on specific hardware.

The primary advantage of substituting these components with learnable functions is the reduction of estimation errors that propagate during plan enumeration. While traditional optimizers often produce sub-optimal plans due to inaccurate independence assumptions or outdated statistics, learned approaches can optimize for the true runtime metric, e.g., latency, by leveraging historical execution feedback to correct their internal models over time.

Theoretical Perspectives

Learnability vs. Complexity Trade-offs

While learnable functions offer adaptability, they are subject to the No free lunch theorem. A function class ${\mathcal {F}}$ that is too complex, e.g., a large neural network, may overfit to a specific workload history and fail to generalize when the workload drifts (the variance problem). In constrast, a class that is too simple, e.g., linear regression, may fail to capture the non-linear performance characteristics of modern hardware (the bias problem).

This trade-off can be mathematically expressed in the decomposition of risk:

I_{P}({\hat {f}}_{n})-\inf _{f\in {\mathcal {F}}^{*}}I_{P}(f)=\underbrace {I_{P}({\hat {f}}_{n})-\inf _{f\in {\mathcal {F}}}I_{P}(f)} _{\text{estimation error}}+\underbrace {\inf _{f\in {\mathcal {F}}}I_{P}(f)-\inf _{f\in {\mathcal {F}}^{*}}I_{P}(f)} _{\text{approximation error}}

Effective AI-powered Database systerms must balance the approximation error (using rich models like Neural Networks) against the estimation error (requiring vast amounts of training data and stability).

The Bitter Lesson and Scalability

The move toward learnable functions in databases can be discussed from the perspective of "The Bitter Lesson" by Richard Sutton^[3], which argues that general methods that leverage computation (search and learning) ultimately outperform methods that rely on human knowledge (heuristics). In databases, a handcrafted cost model (human knowledge) is limited by the designer's foresight. A learnable function, however, scales with the availability of computation and training data (query logs), and hence, theoretically allows the database to asymptotically approach the performance of a theoretically optimal system, often referred to as an oracle machine with perfect information.

References

^ Vladimir N. Vapnik (1995). The Nature of Statistical Learning Theory. Springer.
^ Pan, Hexiang; Cai, Shaofeng; Dinh, Tien Tuan Anh; Wu, Yuncheng; Chee, Yeow Meng; Chen, Gang; Ooi, Beng Chin (2025). CCaaLF: Concurrency Control as a Learnable Function. arXiv:2503.10036.
^ Sutton, Rich (March 13, 2019). "The Bitter Lesson". Retrieved 2025-12-02.

External links

Self Driving Database

[Vapnik1995-1] Vladimir N. Vapnik (1995). The Nature of Statistical Learning Theory. Springer.

[2] Pan, Hexiang; Cai, Shaofeng; Dinh, Tien Tuan Anh; Wu, Yuncheng; Chee, Yeow Meng; Chen, Gang; Ooi, Beng Chin (2025). CCaaLF: Concurrency Control as a Learnable Function. arXiv:2503.10036.

[Sutton2019-3] Sutton, Rich (March 13, 2019). "The Bitter Lesson". Retrieved 2025-12-02.

[1]

[2]

[3]