Loading huge SAS dataset through saspy #560

devashree610 · 2023-09-27T17:17:48Z

devashree610
Sep 27, 2023

Hi Tom,
I am python developer and recently joined in a project which works on SAS . we need to create python application and do some analysis with SAS data in python. So for this we have used saspy sd2df function and were able Load SAS data from sas enterprise guide and were able to do rest of the process through pandas successfully in test environment. But now problem is we have certain tables which are huge like 10GB in production environment. Since we are still in testing phase and we don't have huge data. In test environment it's just 27MB and we were able to read data successfully. So question is Can saspy's sd2df read 10GB dataset from SAS enterprise guide or is there any performance issue since some of our colleagues says it will hang and suggesting to use pyspark as alternative. I would request your suggestion on this.

tomweber-sas · 2023-09-27T18:37:56Z

tomweber-sas
Sep 27, 2023
Maintainer

Hey, thanks for adding this discussion! So, first, just to clarify terminology, EG is a client application, just like SASPy. It is not SAS. It connects to a SAS sever (a Workspace server), same as SASPy does. So, it's the Workspace server each client is connected to that processes SAS data sets.

Now, to your question. saspy can transfer large data; there' s no particular limit in any way. That being said, it is not the fastest data transfer out there, especially with big data. Some of that is on Pandas, as all of the data has to be in memory in Python and so that can become a problem if you don't have plenty of memory for the Python process. But, some of that is also on saspy, as the means by which I transfer data is by generating compatible SAS code to process/transform and stream the data over to Pandas in a format that will allow Pandas to generate the DataFame with the correct values and types based upon the contents and formats of the data in SAS. This has to work with 10 year old SAS in the field, so nothing has ever been added to SAS to help with this - there's no saspy client/server code in SAS that knows anything about saspy. Were there a proprietary interface w/ SASPy, it would be able to be faster, but then most customers wouldn't be able to use it since they are still using earlier versions of SAS for their deployments.

I just tried to transfer a 12G dataset, to see how long it takes on my system, and it failed for me because I don't have enough memory on my machine to support the dataframe:

>>> start = time.time()
>>> df = sas.sd2df('saspy12g','x')
Unable to allocate 12.9 GiB for an array with shape (115, 15000000) and data type float64
sasdata2dataframe was interrupted. Trying to return the saslog instead of a data frame.
>>> print("time=", str(time.time()-start))
time= 538.7130353450775
>>> 538.7/60
8.978333333333333

So, here's a try w/ 4G. The transfer should be pretty linear, depending upon your networking, so multiplying what this takes by 3, would be about the time to transfer the 12G if I could support it in Python. This took about 3 min, so say 9 or so for the 12G. And, it was at 9 min it failed to get the memory, so that kinda matches up.

>>> start = time.time()
>>> df = sas.sd2df('saspy4g','x')
>>> print("time=", str(time.time()-start))
time= 183.5802023410797
>>>
>>> 183.58/60
3.0596666666666668
>>>
>>> df.shape
(5000000, 115)
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000000 entries, 0 to 4999999
Columns: 115 entries, xIn1 to yBinary
dtypes: float64(115)
memory usage: 4.3 GB
>>>

One other thing that can vary a little is that SAS can be deployed in many different ways (local SAS install, Workspace Server, Viya Compute server, ...), so I have different access methods to connect to these different deployments. Again, SAS knows nothing about saspy, so I have to do everything to support all of these. But, the saspy methods work the same across all access methods, so any code you have works equivalently regardless of which kind of SAS deployment you point it at; you can point it at a different SAS and the code will work the same!.

The point of that is the performance may vary some depending upon how you're connected; local/remote, STDIO, SSH, IOM,. HTTP. But the networking (it you're connecting remote) can make more of a difference than the Access Methods.

The previous 4G run was using STDIO with a local deployment on Linux, which is the fastest case. Here's the same using IOM Remote for the connection, which will be slower because it has to write it to disk first then transfer that over the network via an IOM API. That took 5 min instead of 3. So 15 min for the 12G in this test.

>>> start = time.time()
>>> df = sas.sd2df('saspy4g','x')
>>> print("time=", str(time.time()-start))
time= 315.5571081638336
>>>
>>> 315.55/60
5.259166666666667
>>>

Sorry this is so long, but the short of it is that it works, but your performance can vary depending upon your environment. If you can get timings for part of the data, you should be able to estimate the larger data timings.

I'm happy to look at what you have and how you're connecting and see if there's anything I can do to help or to help if you have any issues.

Tom

1 reply

devashree610 Sep 28, 2023
Author

Hi Tom,
Thanks for detail explanation. I am using config method as iomlinux. I will try to load huge data first and then will decide on further process. Thanks for your help.

tomweber-sas · 2023-09-28T14:05:30Z

tomweber-sas
Sep 28, 2023
Maintainer

Sure thing! That's the IOM Remote access method, Let me know what you find with some different data on your systems.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading huge SAS dataset through saspy #560

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Loading huge SAS dataset through saspy #560

devashree610 Sep 27, 2023

Replies: 2 comments · 1 reply

tomweber-sas Sep 27, 2023 Maintainer

devashree610 Sep 28, 2023 Author

tomweber-sas Sep 28, 2023 Maintainer

devashree610
Sep 27, 2023

Replies: 2 comments 1 reply

tomweber-sas
Sep 27, 2023
Maintainer

devashree610 Sep 28, 2023
Author

tomweber-sas
Sep 28, 2023
Maintainer