Hi,
it appears that on Windows thread local storage (TLS) static members in template classes do not get initialized properly in worker threads.
Here is a (maybe not so minimal) reproducing test:
--------------------------------------------------------------------------------------------------------------------------------------------------
#include <iostream> #include <sstream> #include <string> #include <thread> #include <omp.h> #define NTHREADS 2 #define THREAD thread_local class MyClass { public: MyClass(int n): m_n(n) { std::ostringstream s; s << "MyClass::MyClass(int) in thread "<< std::this_thread::get_id() << ", m_n = "<< m_n << std::endl; std::cout << s.str(); } MyClass(MyClass & other): m_n(other.m_n) { std::ostringstream s; s << "MyClass::MyClass(MyClass &) in thread "<< std::this_thread::get_id() << ", m_n = "<< m_n << std::endl; std::cout << s.str(); } long long m_n; }; /*=======================================*/ template<typename T> class TemplateContainer { public: typedef T obj_type; static void print(); static THREAD obj_type s_obj; }; template<typename T> THREAD typename TemplateContainer<T>::obj_type TemplateContainer<T>::s_obj(1); template <typename T> void TemplateContainer<T>::print() { std::ostringstream s; s << "TemplateContainer<>::print() in thread "<< std::this_thread::get_id() << ", MyClass::m_n = "<< s_obj.m_n << std::endl; std::cout << s.str(); } /*=======================================*/ class NonTemplateContainer { public: typedef MyClass obj_type; static void print(); static THREAD obj_type s_obj; }; THREAD typename NonTemplateContainer::obj_type NonTemplateContainer::s_obj(2); void NonTemplateContainer::print() { std::ostringstream s; s << "NonTemplateContainer::print() in thread "<< std::this_thread::get_id() << ", MyClass::m_n = "<< s_obj.m_n << std::endl; std::cout << s.str(); } /*=======================================*/ int main() { omp_set_dynamic(0); omp_set_num_threads(NTHREADS); std::cout << "======================================================================"<< std::endl; #pragma omp parallel NonTemplateContainer::print(); std::cout << "======================================================================"<< std::endl; #pragma omp parallel TemplateContainer<MyClass>::print(); std::cout << "======================================================================"<< std::endl; }
--------------------------------------------------------------------------------------------------------------------------------------
When compiled in Intel Parallel Studio XE 2016 (i.e. MSVS 2015 + Intel Compiler 16.0 Update 3) in Debug configuration with all default options plus /Qopenmp and /Qstd=c++11, this code produces the following typical output:
MyClass::MyClass(int) in thread 13724, m_n = 2 MyClass::MyClass(int) in thread 13724, m_n = 1 ====================================================================== MyClass::MyClass(int) in thread 14276, m_n = 2 NonTemplateContainer::print() in thread 13724, MyClass::m_n = 2 MyClass::MyClass(int) in thread 11568, m_n = 2 NonTemplateContainer::print() in thread 11568, MyClass::m_n = 2 ====================================================================== TemplateContainer<>::print() in thread 13724, MyClass::m_n = 1 TemplateContainer<>::print() in thread 11568, MyClass::m_n = 2073053888752 ======================================================================
Here we see that static TLS member s_obj was first initialized in master thread in both non-template and template class, and then in two spawned worker threads for the non-template container. For the template container there was no static initialization in worker threads. Since one of the TemplateContainer<>::print() functions of the parallel region seems to be executed in the master thread (why?), it outputs the initialized value, but the other one executed in worker thread outputs non-initialized garbage - that is, if you're lucky, mostly it will just crash with access violation.
The same code on Linux (CentOS 6.8) with icpc 16.0.3 compiled with -qopenmp -g -std=c++11 produces:
====================================================================== MyClass::MyClass(int) in thread 140110818879296, m_n = 2 MyClass::MyClass(int) in thread 140110660962176, m_n = 2 MyClass::MyClass(int) in thread 140110818879296, m_n = 1 MyClass::MyClass(int) in thread 140110660962176, m_n = 1 NonTemplateContainer::print() in thread 140110818879296, MyClass::m_n = 2 NonTemplateContainer::print() in thread 140110660962176, MyClass::m_n = 2 ====================================================================== TemplateContainer<>::print() in thread 140110660962176, MyClass::m_n = 1 TemplateContainer<>::print() in thread 140110818879296, MyClass::m_n = 1 ======================================================================
Here only two threads are used and in both of them TemplateContainer::s_obj seems to be initialized correctly.
Can anyone explain whether I'm doing something wrong or if this seems to be a bug?
P.S.: I'm running tests with OMP_NUM_THREADS set to 1 and then number of threads dynamically changed to 2 in code (see main() above), but looks like the way the number of threads is set is irrelevant.