We had to IPL one of our production lpars yesterday because of a job that drove our CSA up on that lpar to over 96%.
We got a console dump of the MASTER address space on that lpar (LSYS) and sent it to IBM. IBM said it was due to another OEM product called ASG Zebb.
When we sent the dump to ASG they said it was a problem with their ZEBB product's tape scratch exit (ZEBB44XT).
So to circumvent yesterday's CSA problem we disabled the Zebb exit and that seemed to lower and stabilize our CSA usage.
However, after we made that change we continued to monitor the CSA and found that now we were seeing more orphaned segments but this time in the ECSA.
We got another console dump of the master address space (LSYS lpar) and sent that to IBM too.
This time IBM said it looks like an OME CA product leaving the ECSA orphans. ...
Our question is whether this is CA1 and is it the system that's leaving these orphaned ECSA segments?
The MLEL table is not orphan storage, though the address space that may do the storage-obtain is not present anymore.
CA 1 keeps a small (108 bytes) chunk of ECSA for each tape unit we see.
Now, we have no idea what tape units might be on each system, so we dynamically add a tape-unit to this growing table each time we see a new one for the first time. But of course we don't want to perform a tiny getmain each time, so we have a cell-pool routine that gets storage in x'1000' byte chunks.
We then take small 108-byte chunks out of the 1k area until we need another 1k area.
In looking at the storage in the dump provided as an example, the first MLEL entry is at location 1D296030 as part of a 1k area obtained at 1D296000. (the first 48 bytes are a header area). In looking at the dump you provided, we are currently on our 15th 1K area (1D10C000). Since each x'1K' area contains 37 individual MLEL entries, that means we have seen somewhere between 519 and 554 tape units have been seen on this system.
So, the 15K of ECSA used for this MLEL chain is not orphan. The 108-byte area that is device-level is not "released" when the device is varied offline.
So yes, once a new TAPE device is seen the first time, the 108-byte area is obtained and is not released.
Again, because CA 1 does not perform tape allocation/deallocation we do not release the storage when a device is brought online of released when it is varied offline. Instead, we get control at OPEN and if the device was never seen before we obtain a new 108-byte area from our cell-pool processing.
As to the question, who does the getmain. Again, the logic of CA 1 is to be able to run without a centralized address space. This allows CA 1 to be more robust in terms of not having to worry about an address space going down and losing any TAPE tracking events. It also means that in this case if a simple IEBGENER was submitted by any user, and it was allocated to a TAPE device that was never seen before we would obtain a new 108-byte entry from our cell-pool processing. If the current 1-K pool was full; that would then trigger us to get a new 1-K pool and to use the first 108-byte entry in it.
So any job that allocates tape devices could cause a new 1-K pool to be obtained. It simply depends on if the TAPE devices was seen before or not, and if the current 1-K pool is full or not.