Tuesday, March 20, 2012

Query to Oracle via linked server hangs

Basic description:

We have developed a solution that sends data from SQL Server to an Oracle server as the result of a stored procedure called by a job that runs every minute. While this development worked fine in our test environment, after moving it to production it ran successfully the first minute, but the second minute the stored procedure hung, and the process could not be killed. In order to stop the process I had to stop both the SQL Agent and the MSDTC services.

Our SQL Server box:
SQL Server 2000 Standard Edition SP4
Windows 2003 Server R2 SP1

Our Oracle box:
Test: Oracle 9.2.0.6
Production: Oracle 9.2.0.4

To setup the SQL box, I did the following:
1) Install Oracle Client Tools version 10.2.0.1
2) Restart Server
3) Modify the registry as follows:
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSDTC\MTxOCI] "OracleXaLib"="oraclient10.dll" "OracleSqlLib"="orasql10.dll" "OracleOciLib"="oci.dll"
4) Modified the PATH variable so that all references to SQL Server appear in front of Oracle path references
5) Added the linked server via sp_addlinkedserver '<tns name>','Oracle','MSDAORA','<tns name>'
6) Added linked server logins via sp_addlinkedsrvlogin '<tns name>','False','<SQL user>','<Oracle User name>','<password on oracle>'
7) Changed the registry for MSDTC to match this:

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSDTC]
"AllowOnlySecureRpcCalls"=dword:00000000
"FallbackToUnsecureRPCIfNecessary"=dword:00000001
"TurnOffRpcSecurity"=dword:00000001

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSDTC\Security]
"NetworkDtcAccess"=dword:00000001
"NetworkDtcAccessAdmin"=dword:00000001
"NetworkDtcAccessClients"=dword:00000001
"NetworkDtcAccessTransactions"=dword:00000001
"NetworkDtcAccessTip"=dword:00000001
"XaTransactions"=dword:00000001
"DomainControllerState"=dword:00000000
"AccountName"="NT Authority\\NetworkService"
"NetworkDtcAccessOutbound"=dword:00000001
"NetworkDtcAccessInbound"=dword:00000001

8) Stopped and restarted services in the following order:
1) MSDTC Stop
2) SQL Server Stop
3) MSDTC Start
4) SQL Server Start

The stored procedure:
In a single transaction, the stored procedure compares a production table against a logging table. If a record exists in the production table that is not in the logging table, a record is inserted into logging table, and a record sent to Oracle via an INSERT INTO OPENQUERY('INSTANCE','SELECT Column1, column2, column3,... FROM SCHEMA.TABLE')
SELECT column1,column2,column FROM SQLTable

This stored procedure has worked just fine for us in test, to either the test or production Oracle boxes, but it now fails, and hangs, in production to either the test or production Oracle boxes.

Additionally, I can run the following query via Query Analyzer from our test box to both the test Oracle and production Oracle and it runs successfully (this is NOT used in our stored procedure code, but is presented here as an indication that I think there is something wrong with the settings on our production SQL box):

SELECT * FROM OPENQUERY('INSTANCE','SELECT * FROM SCHEMA.TABLE')

When I run this same query via Query Analyzer on our Production SQL box, to either the test Oracle or production Oracle, it hangs, and I have to kill the process, and restart the MSDTC service.
Other queries that hang are:
SELECT * FROM SERVER..SCHEMA.TABLE

Additionally, I noticed that when I used this method to kill the process I would see errors like the following in the Application Event Log on the SQL box:

The XA Transaction Manager attempted to perform recovery with the XA resource manager. The XA resource manager reported that recovery was unsuccessful. DSN = MTxOCI.Dll.

Since I figured this was an aborted transaction still residing in the MSDTC log file, I would stop the MSDTC service, delete the MSDTC log file, reset the MSDTC log, and then restart the MSDTC service in order to prevent this error from occurring.

Not ALL queries from the production SQL box to production and test Oracle boxes fail. I can get results returned for this query:

SELECT COLUMN1, COLUMN2 FROM SERVER..SCHEMA.TABLE

I've been scouring the internet for about a week now, and I've run out of ideas on what to check on the production SQL box. Any suggestions would be greatly appreciated.

Tim

Dismayed by the lack of comment on my problem, I did some additional research into what is happening on my machines. I setup the Sysinternals utility ProcMon to capture file access when I run the query, both on Production and on Test. The biggest different I noticed was that, on Production where the queries are hanging, there's ALOT of activity by DLLHOST.exe opening oracle dlls. No such activity occurs on Test. In fact on Test, DLLHOST.exe isn't running at all.

So, I did some searching on Oracle's metalink, and Note 333327.1 (which points to KB 833388) suggests that DLLHOST.exe indicates that I've somehow configured my Oracle Provider for OLEDB to run Out of Process, and suggests modifying the registry key for the Oracle provider, HKLM\SOFTWARE\Microsoft\MSSQLServer\Providers\OraOLEDB.Oracle\ and set AllowInProcess=1

A couple things about this puzzle me. First, the Providers folder in the registry on my Test box (again, where the queries work fine) does not contain any entries whatsoever. On Production, I've got a bunch of folders in the Providers key, but none for OraOLEDB.Oracle. And why would it even matter, since I've configured my linked server to use MSDAORA? The MSDAORA key contains AllowInProcess=1 on Production.

Am I looking down the wrong path? Any suggestions?
|||

Did some more work on this today, and answered some of my own questions, but not the most important one.

I was able to find my Providers key on Test. Since we have multiple SQL Server instances on Test, it was under an instance folder at HKLM\Software\Microsoft\Microsoft SQL Server\Instance Name\Providers\

I added keys for \OraOLEDB.Oracle\ to both Production and Test, and made sure to set AllowInProcess=1 in the registry. I restarted SQL Server on both Production and Test. The queries, and subsequently the development, both worked fine in Test. On Production, I ran one of my trouble queries, SELECT * FROM LinkServer..Schema.Table, against the test Oracle instance, and I got results!

But when I ran it a second time, the query hung.

I tried the same test against the production Oracle instance. The query ran successfully the first time, but now all subsequent running of the query just hangs.

Back to the drawing board.

|||Worked on this again yesterday. I coordinated with our Oracle guy to see if he could watch what was happening on his end. As we tested, I saw that my SELECT * queries actually could complete today! Sure, they took over a minute, when from test they took less than a second, but it's progress, of a sort.

I setup ProcMon again to catch file and registry key activity again on the Production SQL Server box that has the long running queries. I verified that the linked queries to Oracle are now running In Process (no longer using DLLHOST.exe). But, more interesting, I spotted a number of Buffer Overflows occurring. I thought I was on to something, but then I saw this blog entry:
http://blogs.technet.com/markrussinovich/archive/2005/05/17/buffer-overflows.aspx
So, it may be nothing, and I'm back to the drawing board again.
|||

A resolution! The network/infrastructure folks looked at the problem after I asked them to check out the NIC on the server with the Production SQL, and it turns out that the ports on the Cisco router that the server was plugged into was configured in such a manner that was resulting in lots of dropped packets and packet collisions. After changing the Cisco router setting for those ports, my queries run as expected from my SQL Server production box.

I learned alot during this problem, though....

No comments:

Post a Comment