Skip to content
Snippets Groups Projects
  • George Joseph's avatar
    b57cd014
    res_pjsip_pubsub: Address SEGV when attempting to terminate a subscription · b57cd014
    George Joseph authored
    Occasionally under load we'll attempt to send a final NOTIFY on a
    subscription that's already been terminated and a SEGV will occur
    down in pjproject's evsub_destroy function.  This is a result of a
    race condition between all the paths that can generate a notify
    and/or destroy the underlying pjproject evsub object:
    
     * The client can send a SUBSCRIBE with Expires: 0.
     * The client can send a SUBSCRIBE/refresh.
     * The subscription timer can expire.
     * An extension state can change.
     * An MWI event can be generated.
     * The pjproject transaction timer (timer_b) can expire.
    
    Normally when our pubsub_on_evsub_state is called with a terminate,
    we push a task to the serializer and return at which point the dialog
    is unlocked.  This is usually not a problem because the task runs
    immediately and locks the dialog again.  When the system is heavily
    loaded though, there may be a delay between the unlock and relock
    during which another event may occur such as the subscription timer
    or timer_b expiring, an extension state change, etc.  These may also
    cause a terminate to be processed and if so, we could cause pjproject
    to try to destroy the evsub structure twice.  There's no way for us to
    tell that the evsub was already destroyed and the evsub's group lock
    can't tolerate this and SEGVs.
    
    The remedy is twofold.
    
     * A patch has been submitted to Teluu and added to the bundled
       pjproject which adds add/decrement operations on evsub's group lock.
    
     * In res_pjsip_pubsub:
       * configure.ac and pjproject-bundled's configure.m4 were updated
         to check for the new evsub group lock APIs.
       * We now add a reference to the evsub group lock when we create
         the subscription and remove the reference when we clean up the
         subscription.  This prevents evsub from being destroyed before
         we're done with it.
       * A state has been added to the subscription tree structure so
         termination progress can be tracked through the asyncronous tasks.
       * The pubsub_on_evsub_state callback has been split so it's not doing
         double duty.  It now only handles the final cleanup of the
         subscription tree.  pubsub_on_rx_refresh now handles both client
         refreshes and client terminates.  It was always being called for
         both anyway.
       * The serialized_on_server_timeout task was removed since
         serialized_pubsub_on_rx_refresh was almost identical.
       * Missing state checks and ao2_cleanups were added.
       * Some debug levels were adjusted to make seeing only off-nominal
         things at level 1 and nominal or progress things at level 2+.
    
    ASTERISK-26099 #close
    Reported-by: Ross Beer.
    
    Change-Id: I779d11802cf672a51392e62a74a1216596075ba1
    b57cd014
    History
    res_pjsip_pubsub: Address SEGV when attempting to terminate a subscription
    George Joseph authored
    Occasionally under load we'll attempt to send a final NOTIFY on a
    subscription that's already been terminated and a SEGV will occur
    down in pjproject's evsub_destroy function.  This is a result of a
    race condition between all the paths that can generate a notify
    and/or destroy the underlying pjproject evsub object:
    
     * The client can send a SUBSCRIBE with Expires: 0.
     * The client can send a SUBSCRIBE/refresh.
     * The subscription timer can expire.
     * An extension state can change.
     * An MWI event can be generated.
     * The pjproject transaction timer (timer_b) can expire.
    
    Normally when our pubsub_on_evsub_state is called with a terminate,
    we push a task to the serializer and return at which point the dialog
    is unlocked.  This is usually not a problem because the task runs
    immediately and locks the dialog again.  When the system is heavily
    loaded though, there may be a delay between the unlock and relock
    during which another event may occur such as the subscription timer
    or timer_b expiring, an extension state change, etc.  These may also
    cause a terminate to be processed and if so, we could cause pjproject
    to try to destroy the evsub structure twice.  There's no way for us to
    tell that the evsub was already destroyed and the evsub's group lock
    can't tolerate this and SEGVs.
    
    The remedy is twofold.
    
     * A patch has been submitted to Teluu and added to the bundled
       pjproject which adds add/decrement operations on evsub's group lock.
    
     * In res_pjsip_pubsub:
       * configure.ac and pjproject-bundled's configure.m4 were updated
         to check for the new evsub group lock APIs.
       * We now add a reference to the evsub group lock when we create
         the subscription and remove the reference when we clean up the
         subscription.  This prevents evsub from being destroyed before
         we're done with it.
       * A state has been added to the subscription tree structure so
         termination progress can be tracked through the asyncronous tasks.
       * The pubsub_on_evsub_state callback has been split so it's not doing
         double duty.  It now only handles the final cleanup of the
         subscription tree.  pubsub_on_rx_refresh now handles both client
         refreshes and client terminates.  It was always being called for
         both anyway.
       * The serialized_on_server_timeout task was removed since
         serialized_pubsub_on_rx_refresh was almost identical.
       * Missing state checks and ao2_cleanups were added.
       * Some debug levels were adjusted to make seeing only off-nominal
         things at level 1 and nominal or progress things at level 2+.
    
    ASTERISK-26099 #close
    Reported-by: Ross Beer.
    
    Change-Id: I779d11802cf672a51392e62a74a1216596075ba1