Description
Describe the bug
Hi, my company runs tasks in a cloud cluster. The AWS SDK TransferManager is used for file upload and download. We want to upgrade our AWS SDK from 1.7.266 to 1.11.368 to get new features but are running into an issue with a single job that works with a lot of data in parallel. Some of the tasks, rarely, get deadlocked. A process is spawned that does upload, then right after calls ShutdownAPI and then quits. We will have ~6 deadlocks in 12000 upload process spawns. After reading the comment on WaitUntilAllFinished
in TransferManager, this should be working, the code already waits for all the handles to finish individually.
I am currently testing on 1.11.571, same issue persists, but I do not have backtraces for it right now. This is a high priority for me so I will follow up with more information ASAP. I am working on reliably reproducing the issue but it is not that easy, only the cloud job reliably deadlocks after ~50 min and a lot of resources.
1.11.368 backtrace 1:
ShutdownAPI calls TerminateAllComponents, acquires s_registryMutex
and then calls ~DefaultExecutor
waits for default executor threads to join. At the same time another thread is destroying an S3Client for some reason which is stuck trying to acquire s_registryMutex
(gdb) thread apply all bt
Thread 4 (Thread 0x7fc823fff000 (LWP 28519)):
#0 __lll_lock_wait (futex=futex@entry=0x20f2248 <Aws::Utils::ComponentRegistry::s_registryMutex>, private=0) at lowlevellock.c:52
#1 0x00007fc888bab0a3 in __GI___pthread_mutex_lock (mutex=0x20f2248 <Aws::Utils::ComponentRegistry::s_registryMutex>) at ../nptl/pthread_mutex_lock.c:80
#2 0x0000000001c59f41 in __gthread_mutex_lock () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/x86_64-unknown-linux-gnu/bits/gthr-default.h:749
#3 lock () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_mutex.h:100
#4 lock () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/unique_lock.h:138
#5 unique_lock () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/unique_lock.h:68
#6 DeRegisterComponent () at external/aws_sdk/src/aws-cpp-sdk-core/source/utils/component-registry/ComponentRegistry.cpp:61
#7 0x0000000001a1fa83 in ~ClientWithAsyncTemplateMethods () at external/aws_sdk/src/aws-cpp-sdk-core/include/aws/core/client/AWSClientAsyncCRTP.h:79
#8 ~S3Client () at external/aws_sdk/generated/src/aws-cpp-sdk-s3/source/S3Client.cpp:340
#9 0x000000000179bddc in _M_release () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:158
#10 ~__shared_count () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:736
#11 ~__shared_ptr () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:1188
#12 ~TransferManagerConfiguration () at external/aws_sdk/src/aws-cpp-sdk-transfer/include/aws/transfer/TransferManager.h:40
#13 0x0000000001a14aa5 in ~TransferManager () at external/aws_sdk/src/aws-cpp-sdk-transfer/source/transfer/TransferManager.cpp:121
#14 0x0000000001a0df9c in _M_release () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:158
#15 ~__shared_count () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:736
#16 ~__shared_ptr () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:1188
#17 ~(void) () at external/aws_sdk/src/aws-cpp-sdk-transfer/source/transfer/TransferManager.cpp:1007
#18 0x0000000001a1a560 in _M_destroy () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:176
#19 _M_manager () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:200
#20 _M_manager () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:283
#21 0x00000000017a0df2 in ~_Function_base () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:245
#22 ~AmazonWebServiceRequest () at external/aws_sdk/src/aws-cpp-sdk-core/include/aws/core/AmazonWebServiceRequest.h:47
#23 0x0000000001b023de in ~ () at external/aws_sdk/generated/src/aws-cpp-sdk-s3/source/S3Client.cpp:2270
#24 ~_Bind () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/functional:401
#25 _M_destroy () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:176
#26 _M_manager () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:200
#27 _M_manager () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:283
#28 0x0000000001c6caf8 in ~_Function_base () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:245
#29 ~_Head_base () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/tuple:130
#30 ~_Bind () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/functional:401
#31 _M_destroy () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:176
#32 _M_manager () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:200
#33 _M_manager () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:283
#34 0x0000000001c6cbb2 in ~_Function_base () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:245
#35 ~_Head_base () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/tuple:130
#36 ~_Invoker () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/thread:250
#37 ~_State_impl () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/thread:205
#38 ~_State_impl () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/thread:205
#39 0x00007fc888a8cdfe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#40 0x00007fc888ba8609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#41 0x00007fc8888c8353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 3 (Thread 0x7fc846ffd000 (LWP 28469)):
#0 0x00007fc8888c868e in epoll_wait (epfd=6, events=0x7fc846fc72b0, maxevents=100, timeout=100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x0000000001cd5f29 in aws_event_loop_thread () at external/aws-c-io/source/linux/epoll_event_loop.c:614
#2 0x0000000001d6695a in thread_fn () at external/aws-c-common/source/posix/thread.c:177
#3 0x00007fc888ba8609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4 0x00007fc8888c8353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 2 (Thread 0x7fc84e97b000 (LWP 28468)):
#0 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fc8480040f4) at ../sysdeps/nptl/futex-internal.h:183
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fc8480040a0, cond=0x7fc8480040c8) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=0x7fc8480040c8, mutex=0x7fc8480040a0) at pthread_cond_wait.c:647
#3 0x00007fc888a86e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x0000000001c67e7a in wait<(lambda at external/aws_sdk/src/aws-cpp-sdk-core/source/utils/logging/DefaultLogSystem.cpp:36:46)> () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/condition_variable:111
#5 LogThread () at external/aws_sdk/src/aws-cpp-sdk-core/source/utils/logging/DefaultLogSystem.cpp:36
#6 0x0000000001c6946d in __invoke_impl<void, void (*)(Aws::Utils::Logging::DefaultLogSystem::LogSynchronizationData*, std::shared_ptr<std::basic_ostream<char> >, std::basic_string<char> const&, bool), Aws::Utils::Logging::DefaultLogSystem::LogSynchronizationData*, std:--Type <RET> for more, q to quit, c to continue without paging--
:shared_ptr<std::basic_ofstream<char> >, std::basic_string<char>, bool> () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/invoke.h:60
#7 __invoke<void (*)(Aws::Utils::Logging::DefaultLogSystem::LogSynchronizationData*, std::shared_ptr<std::basic_ostream<char> >, std::basic_string<char> const&, bool), Aws::Utils::Logging::DefaultLogSystem::LogSynchronizationData*, std::shared_ptr<std::basic_ofstream<char> >, std::basic_string<char>, bool> () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/invoke.h:95
#8 0x00007fc888a8cdf4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007fc888ba8609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007fc8888c8353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 1 (Thread 0x7fc84ec5d000 (LWP 21299)):
#0 __pthread_clockjoin_ex (threadid=140497574162432, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
#1 0x00007fc888a8d057 in std::thread::join() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x0000000001c6c8a0 in ~DefaultExecutor () at external/aws_sdk/src/aws-cpp-sdk-core/source/utils/threading/DefaultExecutor.cpp:79
#3 0x0000000001a1fe3c in _M_release () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:158
#4 ~__shared_count () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:736
#5 ~__shared_ptr () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:1188
#6 reset () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:1306
#7 ShutdownSdkClient () at external/aws_sdk/src/aws-cpp-sdk-core/include/aws/core/client/AWSClientAsyncCRTP.h:112
#8 0x0000000001c5a21e in TerminateAllComponents () at external/aws_sdk/src/aws-cpp-sdk-core/source/utils/component-registry/ComponentRegistry.cpp:91
#9 0x0000000001bc23c2 in ShutdownAPI () at external/aws_sdk/src/aws-cpp-sdk-core/source/Aws.cpp:206
#10 0x0000000000f5f490 in main () at job.cc:133
I have also found another issue but I cannot reason about it like the first one. The process is deadlocked but it also is a corruption issue? I have a hunch the TransferManager has objects living afterwards, like lambdas, I assume are freed.
1.11.368 backtrace 2:
(gdb) thread apply all bt
Thread 3 (Thread 0x7f8478ff9000 (LWP 143799)):
#0 __lll_lock_wait_private (futex=futex@entry=0x7f84bc7d0b80 <main_arena>) at ./lowlevellock.c:35
#1 0x00007f84bc67b09a in _int_free (av=0x7f84bc7d0b80 <main_arena>, p=0x47d182d0, have_lock=<optimized out>) at malloc.c:4302
#2 0x0000000001a1bc95 in _M_release () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:174
#3 ~__shared_count () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:736
#4 ~__shared_ptr () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:1188
#5 ~AWSClient () at external/aws_sdk/src/aws-cpp-sdk-core/include/aws/core/client/AWSClient.h:97
#6 0x000000000179bddc in _M_release () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:158
#7 ~__shared_count () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:736
#8 ~__shared_ptr () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:1188
#9 ~TransferManagerConfiguration () at external/aws_sdk/src/aws-cpp-sdk-transfer/include/aws/transfer/TransferManager.h:40
#10 0x0000000001a14aa5 in ~TransferManager () at external/aws_sdk/src/aws-cpp-sdk-transfer/source/transfer/TransferManager.cpp:121
#11 0x0000000001a04f0c in _M_release () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:158
#12 ~__shared_count () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:736
#13 ~__shared_ptr () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:1188
#14 ~(void) () at external/aws_sdk/src/aws-cpp-sdk-transfer/source/transfer/TransferManager.cpp:568
#15 0x0000000001a19250 in _M_destroy () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:176
#16 _M_manager () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:200
#17 _M_manager () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:283
#18 0x00000000017a0dd8 in ~_Function_base () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:245
#19 ~AmazonWebServiceRequest () at external/aws_sdk/src/aws-cpp-sdk-core/include/aws/core/AmazonWebServiceRequest.h:47
#20 0x0000000001b2d78e in ~ () at external/aws_sdk/generated/src/aws-cpp-sdk-s3/source/S3Client.cpp:4017
#21 ~_Bind () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/functional:401
#22 _M_destroy () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:176
#23 _M_manager () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:200
#24 _M_manager () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:283
#25 0x0000000001c6caf8 in ~_Function_base () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:245
#26 ~_Head_base () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/tuple:130
#27 ~_Bind () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/functional:401
#28 _M_destroy () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:176
#29 _M_manager () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:200
#30 _M_manager () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:283
#31 0x0000000001c6cbb2 in ~_Function_base () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/std_function.h:245
#32 ~_Head_base () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/tuple:130
#33 ~_Invoker () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/thread:250
#34 ~_State_impl () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/thread:205
#35 ~_State_impl () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/thread:205
#36 0x00007f84bc8c7dfe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#37 0x00007f84bc9e3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#38 0x00007f84bc703353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 2 (Thread 0x7f84827b6000 (LWP 143787)):
#0 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x47af67e0) at ../sysdeps/nptl/futex-internal.h:183
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x47af6790, cond=0x47af67b8) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=0x47af67b8, mutex=0x47af6790) at pthread_cond_wait.c:647
#3 0x00007f84bc8c1e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x0000000001c67e7a in wait<(lambda at external/aws_sdk/src/aws-cpp-sdk-core/source/utils/logging/DefaultLogSystem.cpp:36:46)> () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/condition_variable:111
#5 LogThread () at external/aws_sdk/src/aws-cpp-sdk-core/source/utils/logging/DefaultLogSystem.cpp:36
#6 0x0000000001c6946d in __invoke_impl<void, void (*)(Aws::Utils::Logging::DefaultLogSystem::LogSynchronizationData*, std::shared_ptr<std::basic_ostream<char> >, std::basic_string<char> const&, bool), Aws::Utils::Logging::DefaultLogSystem::LogSynchronizationData*, std::shared_ptr<std::basic_ofstream<char> >, std::basic_string<char>, bool> () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/invoke.h:60
#7 __invoke<void (*)(Aws::Utils::Logging::DefaultLogSystem::LogSynchronizationData*, std::shared_ptr<std::basic_ostream<char> >, std::basic_string<char> const&, bool), Aws::Utils::Logging::DefaultLogSystem::LogSynchronizationData*, std::shared_ptr<std::basic_ofstream<char> >, std::basic_string<char>, bool> () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/invoke.h:95
#8 0x00007f84bc8c7df4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007f84bc9e3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007f84bc703353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 1 (Thread 0x7f8482a98000 (LWP 128384)):
#0 __lll_lock_wait_private (futex=futex@entry=0x7f84bc7d0b80 <main_arena>) at ./lowlevellock.c:35
#1 0x00007f84bc67e32b in __GI___libc_malloc (bytes=2497) at malloc.c:3064
--Type <RET> for more, q to quit, c to continue without paging--
#2 0x00007f84bc89bb29 in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x0000000000f648ba in allocate () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/ext/new_allocator.h:121
#4 allocate () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/allocator.h:181
#5 allocate () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/alloc_traits.h:460
#6 _M_create () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/basic_string.tcc:153
#7 reserve () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/basic_string.tcc:293
#8 0x00000000018ff1eb in to_string () at external/boost/libs/stacktrace/include/boost/stacktrace/detail/frame_unwind.ipp:65
#9 0x00000000018f674d in to_string<std::allocator<boost::stacktrace::frame> > () at external/boost/libs/stacktrace/include/boost/stacktrace/stacktrace.hpp:400
#10 operator<<<char, std::char_traits<char>, std::allocator<boost::stacktrace::frame> > () at external/boost/libs/stacktrace/include/boost/stacktrace/stacktrace.hpp:406
#11 shl_input_streamable<const boost::stacktrace::basic_stacktrace<> >(void) () at external/boost/libs/lexical_cast/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:243
#12 0x00000000018f6458 in operator<<<boost::stacktrace::basic_stacktrace<> > () at external/boost/libs/lexical_cast/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:478
#13 try_convert () at external/boost/libs/lexical_cast/include/boost/lexical_cast/detail/converter_lexical.hpp:487
#14 0x00000000018f5ef2 in try_lexical_convert<std::basic_string<char>, boost::stacktrace::basic_stacktrace<> > () at external/boost/libs/lexical_cast/include/boost/lexical_cast/try_lexical_convert.hpp:201
#15 lexical_cast<std::basic_string<char>, boost::stacktrace::basic_stacktrace<> > () at external/boost/libs/lexical_cast/include/boost/lexical_cast.hpp:42
#16 captureBt () at common/complete_bt.cc:60
#17 0x00000000018f4118 in get () at common/complete_bt.cc:94
#18 0x00000000017a5df8 in programErr () at common/command_line_flags.cc:42
#19 <signal handler called>
#20 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#21 0x00007f84bc606859 in __GI_abort () at abort.c:79
#22 0x00007f84bc671266 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f84bc79b298 "%s\n") at ../sysdeps/posix/libc_fatal.c:156
#23 0x00007f84bc6792fc in malloc_printerr (str=str@entry=0x7f84bc79946a "corrupted double-linked list") at malloc.c:5347
#24 0x00007f84bc67994c in unlink_chunk (p=p@entry=0x4daffc50, av=0x7f84bc7d0b80 <main_arena>) at malloc.c:1460
#25 0x00007f84bc67ae8b in _int_free (av=0x7f84bc7d0b80 <main_arena>, p=0x4daffb90, have_lock=<optimized out>) at malloc.c:4342
#26 0x0000000001ecb55b in EVP_PKEY_free_it ()
#27 0x0000000001ecc290 in EVP_PKEY_free ()
#28 0x0000000001f5b6fc in pubkey_cb ()
#29 0x0000000001dface5 in asn1_item_embed_free ()
#30 0x0000000001dfaff8 in asn1_template_free ()
#31 0x0000000001dfacbf in asn1_item_embed_free ()
#32 0x0000000001dfaff8 in asn1_template_free ()
#33 0x0000000001dfacbf in asn1_item_embed_free ()
#34 0x0000000001dfaf19 in ASN1_item_free ()
#35 0x0000000001f4ebdd in X509_OBJECT_free ()
#36 0x0000000001f420b0 in OPENSSL_sk_pop_free ()
#37 0x0000000001f4f1b5 in X509_STORE_free ()
#38 0x0000000001d3bd01 in s2n_x509_trust_store_wipe () at external/s2n-tls/tls/s2n_x509_validator.c:136
#39 0x0000000001d03ec0 in s2n_config_cleanup () at external/s2n-tls/tls/s2n_config.c:120
#40 0x0000000001d0438a in s2n_config_free () at external/s2n-tls/tls/s2n_config.c:397
#41 0x0000000001cde237 in s_s2n_ctx_destroy () at external/aws-c-io/source/s2n/s2n_tls_channel_handler.c:1402
#42 0x0000000001d66d8d in aws_ref_count_release () at external/aws-c-common/source/ref_count.c:29
#43 0x0000000001ce17c6 in aws_tls_ctx_release () at external/aws-c-io/source/tls_channel_handler.c:790
#44 aws_tls_connection_options_clean_up () at external/aws-c-io/source/tls_channel_handler.c:597
#45 0x0000000001c877cf in Aws::Crt::Io::TlsConnectionOptions::~TlsConnectionOptions() () at external/aws-crt-cpp/source/io/TlsOptions.cpp:283
#46 0x0000000001bc2da0 in _M_release () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:158
#47 operator= () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:755
#48 operator= () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr_base.h:1187
#49 operator= () at external/clang_12_x86_64/x86_64-unknown-linux-gnu/usr/lib/gcc/x86_64-unknown-linux-gnu/10.5.0/../../../../include/c++/10.5.0/bits/shared_ptr.h:358
#50 SetDefaultTlsConnectionOptions () at external/aws_sdk/src/aws-cpp-sdk-core/source/Globals.cpp:40
#51 CleanupCrt () at external/aws_sdk/src/aws-cpp-sdk-core/source/Globals.cpp:60
#52 0x0000000001bc23f4 in ShutdownAPI () at external/aws_sdk/src/aws-cpp-sdk-core/source/Aws.cpp:218
#53 0x0000000000f5f490 in main () at job.cc:133
Regression Issue
- Select this option if this issue appears to be a regression.
Expected Behavior
Clean shutdown. No deadlock.
Current Behavior
Deadlock
Reproduction Steps
Aws::SDKOptions options;
Aws::InitAPI(options);
{
// stats is a custom class that tracks the progress of the upload
UploadStats stats;
Aws::String bucket_name = "my-bucket";
Aws::String s3_file_path = "localfile.txt";
Aws::String local_file_path = "s3file.txt";
Aws::Client::ClientConfiguration client_configuration;
client_configuration.region = "us-west-2";
auto s3_client = Aws::MakeShared<Aws::S3::S3Client>("default", client_configuration);
auto executor = Aws::MakeShared<Aws::Utils::Threading::PooledThreadExecutor>("executor", 8);
Aws::Transfer::TransferManagerConfiguration transfer_config(executor.get());
transfer_config.s3Client = s3_client;
transfer_config.uploadProgressCallback = [&stats](const Aws::Transfer::TransferManager* manager, const std::shared_ptr<const Aws::Transfer::TransferHandle>& handle) {
// stats is a custom class that tracks the progress of the upload
stats.update(*handle);
};
auto transfer_manager = Aws::Transfer::TransferManager::Create(transfer_config);
// Create the upload file requests.
std::vector<std::shared_ptr<Aws::Transfer::TransferHandle>> transfer_handles;
for (const S3FilePath& file_path : paths) {
transfer_handles.push_back(transfer_manager->UploadFile(
local_file_path.c_str(), bucket_name.c_str(), s3_file_path.c_str(),
"binary", Aws::Map<Aws::String, Aws::String>()));
}
// Wait for each transfer to complete and check the status of the transfer.
std::stringstream ss;
int num_failed = 0;
for (auto handle : transfer_handles) {
handle->WaitUntilFinished();
if (handle->GetStatus() != Aws::Transfer::TransferStatus::COMPLETED) {
ss << handle->GetLastError() << "\n";
++num_failed;
}
}
// Blocks the calling thread until all this instance TransferManager operations have finished.
transfer_manager->WaitUntilAllFinished();
// Wait until everything has stopped.
executor->WaitUntilStopped();
// Clean up all AWS objects.
// Do NOT change the order of these operations.
transfer_handles.clear();
transfer_manager = nullptr;
executor = nullptr;
s3_client = nullptr;
// report number of failed transfers
std::cout << "Number of failed transfers: " << num_failed << std::endl;
}
Aws::ShutdownAPI(options);
Possible Solution
Address the issues with TransferManager object lifetime?
Should default executor wait on its own threads without the registryMutex being acquired? Change to TerminateAllComponents.
In TransferManager::DoDownload, calls m_transferConfig.s3Client->GetObjectAsync
in one path. This function submits a new lambda to the s3Client m_executor
- which is the default executor? WaitUntilFinished
in TransferManager does not account for this different executor maybe.
Both the crashes involve PutObjectAsync
and GetObjectAsync
#23 0x0000000001b023de in ~ () at external/aws_sdk/generated/src/aws-cpp-sdk-s3/source/S3Client.cpp:2270
#20 0x0000000001b2d78e in ~ () at external/aws_sdk/generated/src/aws-cpp-sdk-s3/source/S3Client.cpp:4017
Additional Information/Context
No response
AWS CPP SDK version used
1.11.368
Compiler and Version used
clang
Operating System and version
Ubuntu 20.04