2010年6月9日星期三

Eclair Libgralloc Deadlock Problem

As we are developing 0xdroid beagle-eclair branch, we occasionally encounter screen flipping issue. This issue rarely happens however it bothers the user experience very much when running some resource eating applications. Last week in the Computex Taipei 2010 show, we demoed 0xdroid beagle-eclair connecting wireless modules and played games. I noticed that this issue happens very often while playing a game called "Frozen Bubble". (It's a good game, we all love this game a lot, and spent a lot of time on it. ;-) ) It's kind of embarrassing when the screen keeps flipping on the show. Therefore I decided to dig out this issue.




Beside of 0xdroid on beagleboard and Devkit8000, I tested some other platforms and find out actually almost all of them having this problem. Therefore I suspected it's not a hardware related problem. Maybe a framework or HAL issue. We noticed that when the screen is flipping, the logcat message will complain as following

E/SurfaceFlinger(  768): eglSwapBuffers: EGL error 0x3002 (EGL_BAD_ACCESS)
E/gralloc (  768): handle 0x13f8c0 not locked
~E/gralloc (  768): handle 0x13f8c0 already locked for write
E/libagl  (  768): eglSwapBuffers() failed to lock buffer 0x1368e0 (640x480)
E/SurfaceFlinger(  768): eglSwapBuffers: EGL error 0x3002 (EGL_BAD_ACCESS)
E/gralloc (  768): handle 0x13f8c0 not locked
E/gralloc (  768): handle 0x13f8c0 already locked for write
E/libagl  (  768): eglSwapBuffers() failed to lock buffer 0x1368e0 (640x480)
E/SurfaceFlinger(  768): eglSwapBuffers: EGL error 0x3002 (EGL_BAD_ACCESS)
E/gralloc (  768): handle 0x13f8c0 not locked
E/gralloc (  768): handle 0x13f8c0 already locked for write
E/libagl  (  768): eglSwapBuffers() failed to lock buffer 0x1368e0 (640x480)
E/SurfaceFlinger(  768): eglSwapBuffers: EGL error 0x3002 (EGL_BAD_ACCESS)
E/gralloc (  768): handle 0x13f8c0 not locked

Therefore I checked the libgralloc and adding some debug message. The libgralloc plugin 0xdroid used is branched from original eclair source tree. I took few hours created the omap3/libgralloc at the first day when I got eclair source code months ago. Since it works well for the most of time, I didn't pay too much attention to it, until I found the deadlock issue goes crazy on frozen bubble.
After noticing the lock log and swap error, I took a close look of the gralloc_lock and gralloc_unlock in hardware/omap3/libgralloc/mapper.c

int gralloc_lock(gralloc_module_t const* module,
        buffer_handle_t handle, int usage,
        int l, int t, int w, int h,
        void** vaddr)
{
    if (private_handle_t::validate(handle) < 0)
        return -EINVAL;

    int err = 0;
    private_handle_t* hnd = (private_handle_t*)handle;
    int32_t current_value, new_value;
    int retry;

    do {
        current_value = hnd->lockState;
        new_value = current_value;

        if (current_value & private_handle_t::LOCK_STATE_WRITE) {
            // already locked for write 
            LOGE("handle %p already locked for write", handle);
            return -EBUSY;
        } else if (current_value & private_handle_t::LOCK_STATE_READ_MASK) {
            // already locked for read
            if (usage & (GRALLOC_USAGE_SW_WRITE_MASK | GRALLOC_USAGE_HW_RENDER)) {
                LOGE("handle %p already locked for read", handle);
                return -EBUSY;
            } else {
                // this is not an error
                //LOGD("%p already locked for read... count = %d", 
                //        handle, (current_value & ~(1<<31)));
            }
        }

        // not currently locked
        if (usage & (GRALLOC_USAGE_SW_WRITE_MASK | GRALLOC_USAGE_HW_RENDER)) {
            // locking for write
            new_value |= private_handle_t::LOCK_STATE_WRITE;
        }
        new_value++;

        retry = android_atomic_cmpxchg(current_value, new_value, 
    } while (retry);

    if (new_value & private_handle_t::LOCK_STATE_WRITE) {
        // locking for write, store the tid
        hnd->writeOwner = gettid();
    }

    if (usage & (GRALLOC_USAGE_SW_READ_MASK | GRALLOC_USAGE_SW_WRITE_MASK)) {
        if (!(current_value & private_handle_t::LOCK_STATE_MAPPED)) {
            // we need to map for real
            pthread_mutex_t* const lock = &sMapLock;
            pthread_mutex_lock(lock);
            if (!(hnd->lockState & private_handle_t::LOCK_STATE_MAPPED)) {
                err = gralloc_map(module, handle, vaddr);
                if (err == 0) {
                    android_atomic_or(private_handle_t::LOCK_STATE_MAPPED,
                            (volatile int32_t*)&(hnd->lockState));
                }
            }
            pthread_mutex_unlock(lock);
        }
        *vaddr = (void*)hnd->base;
    }

    return err;
}

int gralloc_unlock(gralloc_module_t const* module, 
        buffer_handle_t handle)
{
    if (private_handle_t::validate(handle) < 0)
        return -EINVAL;

    private_handle_t* hnd = (private_handle_t*)handle;
    int32_t current_value, new_value;

    do {
        current_value = hnd->lockState;
        new_value = current_value;

        if (current_value & private_handle_t::LOCK_STATE_WRITE) {
            // locked for write
            if (hnd->writeOwner == gettid()) {
                hnd->writeOwner = 0;
                new_value &= ~private_handle_t::LOCK_STATE_WRITE;
            }
        }

        if ((new_value & private_handle_t::LOCK_STATE_READ_MASK) == 0) {
            LOGE("handle %p not locked", handle);
            return -EINVAL;
        }

        new_value--;

    } while (android_atomic_cmpxchg(current_value, new_value, 
            (volatile int32_t*)&hnd->lockState));

    return 0;
}

The code looks reasonably for the first look. Lock and unlock pair looks good. However there is a very tricky part "android_atomic_cmpxchg may fail". Understanding this, it is not hard to see there is a potential bug in gralloc_unlock.  If android_atomic_cmpxchg fails, it will run the do while loop for more than once. However for the first run, the hnd->writeOwner will be changed to 0.  And then the new_value will not be changed anymore. This lock will goes crazy here after.

The patch solves this problem.

diff --git a/mapper.cpp b/mapper.cpp
index 16ebcc2..1f3e722 100644
--- a/mapper.cpp
+++ b/mapper.cpp
@@ -267,13 +267,13 @@ int gralloc_unlock(gralloc_module_t const* module,
         if (current_value & private_handle_t::LOCK_STATE_WRITE) {
             // locked for write
             if (hnd->writeOwner == gettid()) {
-                hnd->writeOwner = 0;
                 new_value &= ~private_handle_t::LOCK_STATE_WRITE;
             }
         }
 
         if ((new_value & private_handle_t::LOCK_STATE_READ_MASK) == 0) {
             LOGE("handle %p not locked", handle);
+            hnd->writeOwner = 0;
             return -EINVAL;
         }
 
@@ -282,5 +282,6 @@ int gralloc_unlock(gralloc_module_t const* module,
     } while (android_atomic_cmpxchg(current_value, new_value, 
             (volatile int32_t*)&hnd->lockState));
 
+    hnd->writeOwner = 0;
     return 0;
 }

It make sure the value hnd->writeOwner is the same as the first loop, if android_atomic_cmpxchg fails.

This issue comes from the original eclair source tree, and it is still there, and had been inherited to many different platforms.  If you encounter two frames crazily flipping and having the lock message, you may try to take a look of your libgralloc.