1 Introduction
Predicting the dynamic electronic properties of a molecular system is essential to understanding phenomena such as charge transfer and response to an applied laser field. The timedependent Schrödinger equation (TDSE) governs the time evolution of a quantum electronic system. Using the timedependent density operator within a finitedimensional basis yields the Liouvillevon Neumann equation:
(1) 
Here and denote the timedependent electron density and Hamiltonian matrices in orthonormal bases, respectively, and the square brackets denote a commutator: for any square matrices and , the commutator is .
The manybody problem given by (1) can only be solved for simple systems, such as those with very few electrons within a small basis. HartreeFock (HF) theory is a simplified mean field approach in which the manybody wave function is approximated using an antisymmetrized product of single particle orbitals. Applying this approximation to the Hamiltonian for (1) produces twoelectron terms that are given by Coulomb and exchange operators, and thus a Hamiltonian that is now density dependent . Using this HF Hamiltonian, sometimes called the Fock matrix within HF theory, with (1) yields the timedependent HF (TDHF) equation, which, along with timedependent density functional theory (TDDFT), is often used for simulating electron dynamics,
(2) 
Here, using TDHF training data, we address the problem of learning the fieldfree Hamiltonian matrix from time series observations of electron densities . For the fieldfree trajectory, i.e., when the Hamiltonian contains no explicit timedependence, is a complex Hermitian matrix function of , which is also complex and Hermitian. Therefore, and
are completely determined by their upper triangular elements. Both matrices can be represented by vectors that contain the real and imaginary components of their upper triangular parts. Using these vector representations for
and , we develop a statistical model for . This model is linear in its parameters ; in the vector representation, the model Hamiltonian is also a linear function of electron density matrix elements.To fit the model, we minimize a loss function that measures the squared Frobenius norm between the left and righthand sides of (
2), evaluated on training data. This data consists of time series of electron density matrices and their timederivatives computed via centered differencing. The loss function depends on its parameters through the Hamiltonian. Since we use a linear model for the Hamiltonian, our loss function is quadratic in the model parameters . Therefore, to minimize the loss and fit the model, we must solve a least squares problem. Equipped with the Hessian and gradient of the loss function, the solution to this problem reduces to that of . For small systems, we can carry this out effectively, using automatic differentiation to compute and .However, this approach does not scale well to larger molecular systems and results in prohibitively large training times, the majority of which is required for computation of the Hessian matrix. To address this, we develop a data science framework that scales to larger molecules and larger basis sets than in our previous work
[2]. Here, we use dimensionalityreduction techniques based on degrees of freedom in the density matrix and properties of the HF Hamiltonian. Another challenge for large systems with symmetry is that the Hamiltonian model does not extrapolate well to the fieldon case because the Hessian matrix has
eigenvalues, leading to multicollinearity. To resolve this challenge we use ridge regression. Ridge regression places a constraint on the model parameters by adding a penalty to the loss function.To train the Hamiltonian model we use time series of density matrices generated with no external perturbations. Using the learned Hamiltonian, we propagate forward in time to obtain a fieldfree trajectory. To compute a fieldon trajectory, we add a timedependent external perturbation to the learned Hamiltonian and propagate forward in time. We find that the learned fieldfree Hamiltonian can be used to propagate electron dynamics in both fieldfree and fieldon conditions, yielding results that closely match those obtained via ground truth Hamiltonians.
Our overarching goal is to learn a potential/Hamiltonian for TDDFT to simulate more accurate electron dynamics. Key to this theory is the introduction of a density dependent exchange correlation potential that accounts for quantum electronelectron manybody Coulombic interactions not captured from the classical (mean field) Coulomb contribution. However, the exact form of the exchange correlation portion of the Hamiltonian is unknown. Therefore, our goal is to first develop a method to learn a known, more approximate densitydependent Hamiltonian, like that used in timedependent HartreeFock (TDHF) theory [9, 10, 8, 6]
. This work provides the methodological development for a framework that seeks to model the Hamiltonian and use it to predict the dynamics of the system. This work sets us on a pathway towards developing a novel statistical/machine learning method for more complex theories for predicting electron dynamics.
2 Methods
2.1 Generating Data
In this paper, we predict electron dynamics for six molecular systems: in the 631G basis set (two electrons in 4 basis functions), in the 631G and 6311++G basis sets (two electrons in 4 and 14 basis functions, respectively), LiH in the 631G and 6311++G basis sets (four electrons in 11 and 29 basis functions, respectively), and in the STO3G basis set (16 electrons in 14 basis functions). Note that each molecular orbital created from a linear combination of these atomic orbital basis functions is doubly occupied. We build off of our previous work that developed models for the simpler systems, , and LiH in the STO3G basis set (two electrons and two basis functions) [2].
For each molecular system, we apply standard electronic structure methods to compute the ground truth fieldfree Hamiltonian/Fock matrix and variationally determine the ground state electron density matrix . Our initial condition at is either fieldfree with determined from solving for the electron density in the presence of an applied electric field (a deltakick perturbation at ), or is the ground state electron density and we apply the field during propagation (see below). We then numerically solve (2) to generate an electron dynamics trajectory , recording the data at temporal resolution a.u., propagating with the modified midpoint unitary transformation method [7, 11]. These steps were performed using a modified version of the Gaussian electronic structure code [4]. We generate two data sets for each molecular system:

Fieldfree trajectory: The initial density matrix in the presence of an electric field is calculated. Using this initial condition as the delta kick perturbation and then without applying any external perturbation during propagation, a trajectory is produced. A part of this trajectory, i.e., density matrices where , is used for training and another part of this trajectory for is used as a validation set.

Fieldon trajectory: The initial ground state density matrix without any perturbation is calculated. An external forcing term is applied during propagation, where is the applied electric field in the z direction (along the main molecular bond axis), is the electric field frequency, and is the
component of the molecular dipole moment. For this study, the electric field is turned on for one cycle (
= ) at , with a.u (an offresonant frequency corresponding to the neodymiumYAG laser) and a.u. We test our learned Hamiltonian against this fieldon trajectory; fieldon trajectories are never used during the training process.
2.2 Statistical Learning
Our aim is to learn the molecular Hamiltonian , which is a Hermitian matrixvalued function of the Hermitian density matrix as in (2). Since and are Hermitian, they are completely determined by their uppertriangular components. We split and into real and imaginary matrices and then flatten and combine the uppertriangular parts of each matrix into corresponding real vectors. Let , denote real column vectors that contain the real and imaginary parts of the uppertriangular portions of the complex matrices , . Let tildes denote statistical models—to be clear, is the model Hamiltonian, different from the true Hamiltonian . As in [2], we use a linear model and squared loss
(3)  
(4) 
where , , , and . The loss function quantifies the mismatch between the left and righthand sides of (2), with the timederivative approximated by a centereddifference quotient. To train, we compute that minimizes on the training data:
(5) 
This is a least squares problem. Let denote the Hessian of the loss with respect to . Let , the gradient of the loss with respect to , evaluated at . To solve (5), we can take the gradient of the loss function and set it to 0. This results in the normal equations, which we can write in terms of the Hessian and gradient of the loss (see Appendix A for details):
(6) 
We briefly explain the meaning of the loss by asking the hypothetical question: if refers to ground truth electron density matrices in our training data, what does it mean for the loss function to vanish? Consider the following equation, which defines a onestep prediction of :
(7) 
For to vanish, for each , we must be able to insert the true values of and into the righthand side of (7) and obtain a predicted that perfectly matches the true . In short, the loss measures the deviation from perfect onestep or local prediction via (7), across the entire training time series. We use the loss as a proxy for the true metric of interest, which is longterm propagation error (12). Direct or adjointbased minimization of (12) can in principle be used to solve for ; however, this will be much more computationally expensive than our approach.
As described above, the training data consists of fieldfree trajectories. For each molecular system, we train using time series of density matrices where obtained from the fieldfree trajectory to ensure that this learned Hamiltonian does not depend on an external field. We do not use the first two time steps of the trajectory since these time steps have large values of , a consequence of the deltakick initial condition.
The solution to (6
) results in the statistical estimates
and the molecular Hamiltonian can then be determined using (3). We tested the model for small molecules in small basis sets (up to 6 6 in dimension for the complex Hamiltonian). When we sought to extend this approach to more complex molecular systems, we encountered two main problems: (i) training times were unacceptably large due to automatic differentiation, and (ii) propagation results were inaccurate. To solve (i), we coded the gradient and Hessian of the loss (4) ourselves, leveraging parallelization—see Appendix B for details. To solve (ii), we applied dimensionality reduction and ridge regression, which we now detail in turn.2.2.1 Dimensionality Reduction
We consider diatomic molecules in the 631G and 6311++G** bases and the larger molecule in the small STO3G basis. Let denote the dimension of the density and Hamiltonian matrices for each molecule in a given basis set. For larger basis sets or larger molecules, increases dramatically; see Table 1. Our initial implementation leads to a naïve version of (3) in which is of size and hence is an matrix of regression coefficients. We employ two tactics to reduce the dimensionality of . First, we split (3) into two separate models, such that the parts of that correspond to real (respectively, imaginary) components of depend only on the real (respectively, imaginary) components of . This splitting, which can be justified based on physical properties of the HartreeFock Hamiltonian, was not present in our prior work [2]. At time , the true fieldfree Hamiltonian in the AO basis is,
(8) 
Here is the kinetic energy matrix, is the electronnuclear energy matrix, and is the density dependent combination of Coulomb and exchange matrices. Let , then for ,
(9) 
where
is a fourindex tensor in the Coulomb and exchange calculations. Because this tensor is real, the real elements of the Hamiltonian depend on the real elements of the density matrix and the imaginary elements of the Hamiltonian depend on the imaginary elements of the density matrix. The second tactic used to reduce dimensionality is that when forming the flattened vector representation
, we retain only those entries of where the corresponding entries of are not identically zero [2]. For these linear or flat molecular systems, elements are identically zero due to the molecular symmetry, e.g. if they are constructed from orthogonal basis functions. In this way, for the largest problem under consideration, we reduce from to , reducing the number of coefficients by a factor .2.2.2 Ridge Regression
When we scale our method to molecular systems with large , we also notice multicollinearity, e.g., numerous zero eigenvalues in the Hessian of the loss . With multicollinear data, the least squares estimator predicts poorly. We eliminate this problem by using ridge regression, for which we can write the penalized loss function as ; note the use of the norm, as opposed to the norm in the penalty term for Lasso, i.e., [5]. In this work, we train our model by computing the ridge regression solution:
(10) 
where is the Hessian of with respect to and is the gradient of with respect to computed at . For a grid of values, we compute on the training set, and then compute the loss on a validation set that is disjoint from but equal in size to the training set. Figure 1 shows the validation loss for different values for in the STO3G basis and in the 6311G** basis set. We choose that minimizes the validation set loss.
One might conclude incorrectly from Figure 1 that, as the range of values on the vertical axes is small (multiplied by in the left panel and on the right), choosing a nonoptimal may not affect final results. In practice, we find that fieldon propagation results improve considerably if we choose the optimal . We hypothesize that this occurs for two reasons. First, the loss function essentially measures local or onestep propagation error, as described in Section 2.2. Second, we note that (2
) is a nonlinear system of ordinary differential equations; the righthand side is quadratic in the elements of
. Nonlinearity can magnify errors in the estimated Hamiltonian. Over thousands of time steps, these errors can accumulate and cause predicted trajectories to diverge substantially from reality. We also explored Lasso, but chose ridge regression due to its superior performance.3 Results
Applying the training procedure described in Section 2 to the molecular systems listed in Table 1, we learn and determine . Here, for smaller molecular systems, we train using time series with points. For larger systems, we increase the training set size; we determine the number of points by computing a learning curve, plotting test set propagation error against the number of training points. Let us illustrate the effect of training set size for two of the larger systems studied here: in 6311++G** and the largest molecular system, LiH in 6311++G. Figure 2 shows that, as we increase the training set size, fieldon propagation error decreases (blue) while computational time for training increases (gray). The training set size used for each system is given in 1.
For propagation, we use RK45 ([3]) to solve (2) numerically with the learned Hamiltonian for 2000 steps. We do this both for the case of a delta kick perturbation (the same as the training data, a fieldoff perturbation) and for the case of a sinusoidal electric field perturbation (a fieldon perturbation). The fieldon perturbation tests the learned Hamiltonian in a regime that is outside that of the training set.
Molecule  Basis set  Training Set size  Training Loss  fieldfree error  fieldon error  

631G  
631G  
LiH  631G  
STO3G  
6311++G**  
LiH  6311++G** 
The training loss, fieldfree, and fieldon propagation error for six molecular systems are presented in Table 1. Training loss reported here is calculated as using (4). This training loss measures the squared Frobenius norm of one step errors, i.e, the error in propagating to the next time step using the learned Hamiltonian via (7). The small values of the fieldfree error, for all molecules, indicate that the Hamiltonian learned by minimizing (4) can be used for longterm propagation. Even with an applied field, which is outside the training regime, we obtain propagation errors comparable to if not less than those in the fieldfree case, implying that the learned Hamiltonian generalizes well beyond the training regime.
Let denote the prediction, i.e, density matrix obtained by propagating the learned Hamiltonian. We define the timedependent propagation error as
(11) 
where measures the error (at time ) between , the predicted trajectory obtained by propagating the learned Hamiltonian, and , the ground truth trajectory. We calculate the mean propagation error for the propagation interval as
(12) 
where is the number of time steps for which we propagate the Hamiltonian. For this study, . In Fig. 3 we plot the timedependent propagation errors for all molecular systems in both the fieldfree and fieldon cases. We see that that errors for both cases remain reasonably small for all molecular systems even after propagating for a.u., which is equivalent to time steps.
In Fig. 4, we plot, as a function of time, selected nonzero elements of the density matrix obtained by propagating the learned Hamiltonian (red), and the ground truth (blue) obtained from a widelyused electronic structure code (see details in Section 2). We observe good agreement between predicted and ground truth trajectories.
4 Discusssion
In this work, we extended our prior methodology by incorporating dimensionality reduction (in the form of realimaginary splitting) and ridge regression. Using these techniques, we addressed challenges in scaling our method to molecular systems with larger basis set size . Using the learned Hamiltonian, we can predict electron densities for not only the training set (field free) but also for the test set (field on). The loss function (4) measures the sum of squares of onestep propagation errors, a form of local error. By minimizing this loss over the training set, we obtain very good longtime propagation error. For some molecular systems, the agreement is to a degree that we cannot tell the two curves (propagation using learned Hamiltonian and the ground truth trajectory) apart.
We used two dimensionality reduction techniques to significantly reduce the number of model parameters: (i) splitting the Hamiltonian model based on properties of the HF Hamiltonian and (ii) modeling only nonzero elements of the Hamiltonian matrix. The effective number of degrees of freedom in the Hamiltonian is less than the number of nonzero elements due to the linear combinations of Hamiltonian elements that expressed through (2). This reduction can be easily observed for smaller molecular systems like in the STO3G basis set. For larger molecular systems, these linear dependencies are much more prevalent and more difficult to verify directly. In such cases, regularization improves the prediction capability of a model by decreasing the number of degrees of freedom. Here, using ridge regression we successfully reduced the fieldon propagation error.
We also coded the Hessian and gradient for the loss function instead of using automatic differentiation techniques, thus making it feasible to obtain the least squares solution for larger molecular systems. Although for most molecular systems we used one fieldfree trajectory with time steps for training, for larger systems such as LiH (), we increased the training set size and observed that the fieldon propagation error decreases for a larger training set. However, as we increase the training set size, the computational time for training increases, eventually becoming prohibitively expensive for a training set with more than time steps. In the future, we hope to extend this model to even larger molecular systems, and also learn a densitydependent Hamiltonian based on more accurate wave function generated densities.
Acknowledgments
This work was supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences under Award Number DESC0020203. We acknowledge computational time on the MERCED cluster (funded by NSF ACI1429783), and on the Nautilus cluster, which is supported by the Pacific Research Platform (NSF ACI1541349), CHASECI (NSF CNS1730158), and Towards a National Research Platform (NSF OAC1826967). Additional funding for Nautilus has been supplied by the University of California Office of the President.
Data Availability Statement
All code required to reproduce all training and test results is available on GitHub at https: //github.com/hbhat4000/electrondynamics [1]. Training data is available from the authors upon request.
Financial disclosure
None reported.
Conflict of interest
The authors declare no potential conflict of interests.
Appendix A Reduction to Least Squares
We begin by writing the loss (4) as . To minimize , we need to determine
(13) 
We start by expanding the loss function:
To minimize the righthand side, we take the gradient with respect to ,
Setting this gradient to , we obtain the normal equations:
(14) 
Let denote the Hessian of . Since and , we can write (14) as
(15) 
To estimate the ridge regression solution, we need to compute
(16) 
Augmenting the loss with the ridge penalty yields
The gradient is then
Setting the gradient to , we get
(17) 
The Hessian of is . With this, we can write (17) as
(18) 
Appendix B Computation of the Gradient and Hessian
Here we describe the details behind our computation of the gradient and Hessian of the loss function (4). Let us introduce the notation to denote the th row and th column of the matrix . Similarly, let denote the th row and th column of the centereddifference time derivative . We let denote . Then, with denoting complex conjugate in this section, we can rewrite the loss (4) as
(19) 
Whereas we previously wrote , here we give more details. The term refers to an intercept matrix. However, refers to two collections of matrices, and . All matrices here are of the same dimension as . To better understand the roles of these matrices, let us note that depends only on certain nonzero, uppertriangular entries of . We let denote the indices of the real part of upon which we allow to depend. Similarly, we let denote the indices of the imaginary part of upon which we allow to depend. Hence and are, respectively, the total numbers of real and imaginary parts of that are active in the model for .
With this notation, we can write our linear model for as follows—note that we begin with the uppertriangular part: for ,
(20) 
For , because is Hermitian (or selfadjoint), we have
(21) 
Here we have used the fact that and are both real—this is necessary for the real (respectively, imaginary) part of to depend only on the real (respectively, imaginary) part of . Note that in these expressions, we only use the uppertriangular parts of and .
We focus first on the gradient and Hessian of with respect to . From (2021), we see that depends on only through . For any integers and , we define the Kronecker delta . Then
Observe that is of the form . Putting these pieces together, we obtain, with signifying real part,
(22) 
In what follows, we use to denote the indicator function of the set , e.g., . We then compute
Hence
This implies that
(23a)  
(23b) 
This is the gradient of the loss with respect to each of the matrices. In our code, we parallelize this computation across the and indices. More specifically, we implement this calculation via a function that, for a given and , computes for all at once. We then evaluate this function in parallel across all indices ; as mentioned above, the lowertriangular parts of the , , and matrices play no role in our model for .
Examining the form of the model (2021), we note that upon exchanging
(24) 
the roles of and become reversed. Using this fact, we can extract from the above calculation an expression for the gradient : we simply apply the transformation (24) to (23). We have verified by hand that this yields precisely the same result as differentiating directly with respect to .
Further examining (2021), we see that if we set and , then plays the same role as . Setting in (23) gives us the gradient ; again, we have verified this by hand. With this, we have described the full computation of the gradient of with respect to all model parameters.
To begin our calculation of the Hessian, we take a second derivative on both sides of (22) to obtain
The product here yields different terms inside the sum. Through algebra analogous to that used to derive (23), we can compute each of these terms and sum them over all and . The resulting terms give us a closedform expression for . When we implement the Hessian in code, we first develop a function that takes as input fixed values of , , , and , returning as output the partial derivative for all values of and at once. We then evaluate this function in parallel over all possible values of and , again taking into account the fact that only the uppertriangular part of matters. In this way, we compute the central block in the overall Hessian:
(25) 
The calculation of the block can be recycled and converted into calculations of all other blocks. For instance, applying the transformation (24) to in the final expression of the block gives us, by symmetry, both the and blocks. If we then go back and apply the transformation (24)to both and in the final expression of the block, we obtain the block. Similarly, setting either or both of to yields the blocks in the first row and first column of (25).
Through these strategies, we compute all entries of the gradient and Hessian of without recourse to automatic differentiation, which we relied upon in our earlier work [2]. While automatic differentiation yields perfectly accurate gradients and Hessians for small molecular systems, as the system size grows larger, we find that the computational cost of automatic differentiation increases considerably until it becomes unusable. Simultaneously, we find that the parallel computation of analytically derived gradients and Hessians, via the techniques described here, scales well to all molecular systems described in this paper.
References
 [1] (2021) Electron Dynamics. Note: https://github.com/hbhat4000/electrondynamics Cited by: Data Availability Statement.
 [2] (2020) Machine learning a molecular Hamiltonian for predicting electron dynamics. International Journal of Dynamics and Control 8 (4), pp. 1089–1101. Cited by: Appendix B, §1, §2.1, §2.2.1, §2.2.
 [3] (1980) A family of embedded RungeKutta formulae. Journal of Computational and Applied Mathematics 6 (1), pp. 19–26. External Links: ISSN 03770427, Document Cited by: §3.
 [4] (2018) Gaussian Development Version Revision I.14+. Note: Gaussian Inc. Wallingford CT Cited by: §2.1.
 [5] (2001) The Elements of Statistical Learning. Springer, New York, NY, USA. Cited by: §2.2.2.
 [6] (2020) Realtime timedependent electronic structure theory. Chemical Reviews 120 (18), pp. 9951–9993. Note: PMID: 32813506 External Links: Document Cited by: §1.
 [7] (2005) A timedependent HartreeFock approach for studying the electronic optical response of molecules in intense fields. Physical Chemistry Chemical Physics 7 (2), pp. 233–239. External Links: Document, ISSN 14639076 Cited by: §2.1.
 [8] (2016) Perspective: Fundamental aspects of timedependent density functional theory. JCP 144 (22), pp. 220901. External Links: Document Cited by: §1.
 [9] (2012) Fundamentals of TimeDependent Density Functional Theory. SpringerVerlag. Cited by: §1.
 [10] (2016) Electron dynamics with realtime timedependent density functional theory. IJQC 116 (10), pp. 739–749. External Links: Document Cited by: §1.
 [11] (2007) Electronic optical response of molecules in intense fields: Comparison of TDHF, TDCIS, and TDCIS(D) approaches. Journal of Chemical Physics 126 (24), pp. 1–13. External Links: Document, ISSN 00219606 Cited by: §2.1.
Comments
There are no comments yet.