This week I’ve been chasing my own white whale. Another developer came to me with this little problem. A query he was trying to run over a linked server to an Oracle database was returning unexpected results. Here’s his original query:
select count(order_num) from openquery(PROD, 'select t2.order_num from myschema.mytable t2 where open_date < to_date(''01-JAN-2011'', ''DD-MON-YYYY'') and nature = ''057'' and dept_code = ''7240'' and status_code like ''O%''')a
The resulting count was 200. However, running that same query directly on Oracle returns 658. So I moved the count() function inside the remote query (which is a more efficient query anyway, but that’s beside the point).
select * from openquery(PROD, 'select count(order_num) from myschema.mytable t2 where open_date < to_date(''01-JAN-2011'', ''DD-MON-YYYY'') and nature = ''057'' and dept_code = ''7240'' and status_code like ''O%''')a
The result? 658. And running a simple select *, rather that count(*) also returns 658 records. A little poking around on Oracle’s Metalink site netted a few bugs with the Oracle OLEDB provider, including one specifically related to count(*). The bug description said it was 100% reproducible with no workaround. And normally I might let it go at that and move on. But the thing is, it’s not 100% reproducible. If I run that same original query against the same table in our QA environment, I get the correct count. In fact, if I run it against a different column in that same table in the Prod environment, I get the correct count. Running test queries against many columns in several tables in each environment yielded mixed results, with seemingly no rhyme or reason behind which columns/tables returned a correct count, and which returned a count of 200. Now, I realize that this is called a “bug” for a reason, but computers just don’t produce random and arbitrary results. So I determined to figure out why.
I ran an 10046 event trace in Oracle along with a Profiler trace for both the “good” and “bad” scenarios. In
the “good” test, SQL Server requests rows (OLEDB Call Event) in batches of 100:
<inputs>
<hReserved>0×0000000000000000</hReserved>
<lRowsOffset>0</lRowsOffset>
<cRows>100</cRows>
</inputs>
And you see an “OLEDB DataRead Event” for each row fetched:
<inputs>
<hRow>0×0000000000000259</hRow>
<hAccessor>0x000000001127D3E0</hAccessor>
</inputs>
For the last batch, it requests 100 records again, but only gets 4 back:
<hresult>265926</hresult>
<outputs>
<pcRowsObtained>4</pcRowsObtained>
<prghRows>
<HROW>0x000000000000025A</HROW>
<HROW>0x000000000000025B</HROW>
<HROW>0x000000000000025C</HROW>
<HROW>0x000000000000025D</HROW>
</prghRows>
</outputs>
That hresult of 265926 means “Reached start or end of rowset or chapter.” And that all corresponds with what we see
in the Oracle trace file. Each fetch is recorded with a line like this:
FETCH #3:c=0,e=2772,p=0,cr=73,cu=0,mis=0,r=100,dep=0,og=4,tim=18884103172270
where r=100 is the number of rows returned. The final fetch shows r=4. This is what we would expect to see.
Now let’s look at the “bad” test. Here we see no DataRead events in the Profiler trace. The very last OLEDB
Call Event
<inputs>
<hReserved>0×0000000000000000</hReserved>
<lRowsOffset>0</lRowsOffset>
<cRows>100</cRows>
</inputs>
shows 99 records returned.
<hresult>265926</hresult>
<outputs>
<pcRowsObtained>99</pcRowsObtained>
<prghRows>
<HROW>0×0000000000000066</HROW>
[...]
<HROW>0x00000000000000C8</HROW>
</prghRows>
</outputs>
But what’s different here is that the Oracle trace shows 100 records being returned in that last fetch. So why did
SQL Server only receive 99 records? I thought perhaps an OLEDB trace would shed some light, but that’s like reading Sanskrit.
Further examination of the Profiler traces yielded one difference that seemed relevant. From the “good” test:
<inputs>
<cPropertyIDSets>1</cPropertyIDSets>
<rgPropertyIDSets>
<DBPROPIDSET>
<cPropertyIDs>1</cPropertyIDs>
<rgPropertyIDs>
<DBPROPID>DBPROP_ISequentialStream</DBPROPID>
</rgPropertyIDs>
<guidPropertySet>DBPROPSET_ROWSET</guidPropertySet>
</DBPROPIDSET>
</rgPropertyIDSets>
</inputs>
next line:
<hresult>0</hresult>
<outputs>
<pcPropertySets>1</pcPropertySets>
<prgPropertySets>
<DBPROPSET>
<cProperties>1</cProperties>
<guidPropertySet>DBPROPSET_ROWSET</guidPropertySet>
<rgProperties>
<DBPROP>
<dwPropertyID>DBPROP_ISequentialStream</dwPropertyID>
<dwOptions>0</dwOptions>
<dwStatus>0</dwStatus>
<colid>DB_NULLID</colid>
<vValue>
<VARIANT>
<vt>VT_BOOL</vt>
<boolVal>-1</boolVal>
</VARIANT>
</vValue>
</DBPROP>
</rgProperties>
</DBPROPSET>
</prgPropertySets>
</outputs>
next line:
<inputs>
<riid>IID_IRowsetInfo</riid>
</inputs>
next line:
<hresult>0</hresult>
That DBPROP_ISequentialStream. I’m thinking that’s the critical factor that makes the OLEDB provider handle the data differently in the “good” tests, and allows it to process the data correctly. The only problem is: I can’t prove it. Yet. So far the Great Gazoogle has provided me with little helpful information on this property and its impact. But the search continues, if in a slightly less obsessed manner. And hopefully things will work out better for me than they did for Ahab.
Incidentally, there does seem to be a workaround. If, when you create the linked server, you include a FetchSize in the provider string, @provstr=N’FetchSize=500′, with the fetchsize being 200 or more, this should get you the correct result. At least, it did for me.