My name is Or Gerson, and I am part of the DataInfra team in Outbrain.
DataInfra is responsible for a job scheduling system that supplies teams with the ability to define jobs that run queries on Hadoop cluster hosting about 2 PB of data.
Sometimes these queries can be inefficient – resulting in high processing time and extra load on the cluster.
Therefore we define a timeout limit for queries to execute, The idea is to kill the query when timeout occurred – however things are not so simple…
Kill them! Close their resource! Let none escape!
We use Apache-DBCP as our connection pool library and Hive2 driver to submit queries to Hive a pretty simple setup.
Apache DBCP provides connection pool services and is generally recommended as a connector to many relational databases.
Queries are done using Spring JdbcTemplate, but the implementing connections are managed by Apache DBCP.
These queries have a specific timeout set by the calling thread.
When the timeout occurs, the calling thread shuts down the async executor and calls the close() function on a “javax.sql.DataSource” object
This works well when using MapR engine, but when using TEZ the job continues to live long in the cluster, even after the VM is gone.
Seems like a common problem.
To my surprise, my online search found this to be a recurring issue, but without a good solution.
I started drilling down into the “org.apache.commons.dbcp.PoolableConnection” class to understand the problem better.
Shouldn’t closing the datasource be enough?
Under the DBCP and JDBC abstractions I found “org.apache.hive.jdbc.HiveStatement” which uses a thrift client to execute operations on Hive.
When a query was submitted using MapR and reached a timeout, the running job recognized that its handler had died and closed, resulting in “Query Cancelled” status on the waiting thread in HiveStatement.
TEZ did not recognize that its handling connection died, and continued to finish its run (even though the GC had definitely cleared this object).
Moreover, I found out that closing the datasource (using close() method) did not call the close() method on the connection object.
Searching the APIs in “org.apache.commons.dbcp” revealed that it did not expose its connection pool or the objects borrowed from it, so I had no way of interacting with them.
First I needed to verify that I can actually close the connection from the client without interacting with the resource manager running the TEZ job (Yarn) directly.
Luckily I found that “HiveStatement” class behaves well and when calling close() method, the thrift session closes and does indeed kill the TEZ job.
I decided to create a data source that will keep references to the connections being used, allowing it to close them.
Now, our datasource keeps its connection references and all we need is to use those “register” and “unregister” methods which are part of a simple “ConnectionRegister” interface supplying these methods.
After implementing the aforementioned solution, queries using TEZ die on timeout, by explicitly closing the opened connection which delegates it to the thrift layer.