Hi,
it appears that on Windows thread local storage (TLS) static members in template classes do not get initialized properly in worker threads.
Here is a (maybe not so minimal) reproducing test:
--------------------------------------------------------------------------------------------------------------------------------------------------
#include <iostream>
#include <sstream>
#include <string>
#include <thread>
#include <omp.h>
#define NTHREADS 2
#define THREAD thread_local
class MyClass
{
public:
MyClass(int n): m_n(n)
{
std::ostringstream s;
s << "MyClass::MyClass(int) in thread "<< std::this_thread::get_id() << ", m_n = "<< m_n << std::endl;
std::cout << s.str();
}
MyClass(MyClass & other): m_n(other.m_n)
{
std::ostringstream s;
s << "MyClass::MyClass(MyClass &) in thread "<< std::this_thread::get_id() << ", m_n = "<< m_n << std::endl;
std::cout << s.str();
}
long long m_n;
};
/*=======================================*/
template<typename T>
class TemplateContainer
{
public:
typedef T obj_type;
static void print();
static THREAD obj_type s_obj;
};
template<typename T>
THREAD typename TemplateContainer<T>::obj_type TemplateContainer<T>::s_obj(1);
template <typename T>
void TemplateContainer<T>::print()
{
std::ostringstream s;
s << "TemplateContainer<>::print() in thread "<< std::this_thread::get_id() << ", MyClass::m_n = "<< s_obj.m_n << std::endl;
std::cout << s.str();
}
/*=======================================*/
class NonTemplateContainer
{
public:
typedef MyClass obj_type;
static void print();
static THREAD obj_type s_obj;
};
THREAD typename NonTemplateContainer::obj_type NonTemplateContainer::s_obj(2);
void NonTemplateContainer::print()
{
std::ostringstream s;
s << "NonTemplateContainer::print() in thread "<< std::this_thread::get_id() << ", MyClass::m_n = "<< s_obj.m_n << std::endl;
std::cout << s.str();
}
/*=======================================*/
int main()
{
omp_set_dynamic(0);
omp_set_num_threads(NTHREADS);
std::cout << "======================================================================"<< std::endl;
#pragma omp parallel
NonTemplateContainer::print();
std::cout << "======================================================================"<< std::endl;
#pragma omp parallel
TemplateContainer<MyClass>::print();
std::cout << "======================================================================"<< std::endl;
}--------------------------------------------------------------------------------------------------------------------------------------
When compiled in Intel Parallel Studio XE 2016 (i.e. MSVS 2015 + Intel Compiler 16.0 Update 3) in Debug configuration with all default options plus /Qopenmp and /Qstd=c++11, this code produces the following typical output:
MyClass::MyClass(int) in thread 13724, m_n = 2 MyClass::MyClass(int) in thread 13724, m_n = 1 ====================================================================== MyClass::MyClass(int) in thread 14276, m_n = 2 NonTemplateContainer::print() in thread 13724, MyClass::m_n = 2 MyClass::MyClass(int) in thread 11568, m_n = 2 NonTemplateContainer::print() in thread 11568, MyClass::m_n = 2 ====================================================================== TemplateContainer<>::print() in thread 13724, MyClass::m_n = 1 TemplateContainer<>::print() in thread 11568, MyClass::m_n = 2073053888752 ======================================================================
Here we see that static TLS member s_obj was first initialized in master thread in both non-template and template class, and then in two spawned worker threads for the non-template container. For the template container there was no static initialization in worker threads. Since one of the TemplateContainer<>::print() functions of the parallel region seems to be executed in the master thread (why?), it outputs the initialized value, but the other one executed in worker thread outputs non-initialized garbage - that is, if you're lucky, mostly it will just crash with access violation.
The same code on Linux (CentOS 6.8) with icpc 16.0.3 compiled with -qopenmp -g -std=c++11 produces:
====================================================================== MyClass::MyClass(int) in thread 140110818879296, m_n = 2 MyClass::MyClass(int) in thread 140110660962176, m_n = 2 MyClass::MyClass(int) in thread 140110818879296, m_n = 1 MyClass::MyClass(int) in thread 140110660962176, m_n = 1 NonTemplateContainer::print() in thread 140110818879296, MyClass::m_n = 2 NonTemplateContainer::print() in thread 140110660962176, MyClass::m_n = 2 ====================================================================== TemplateContainer<>::print() in thread 140110660962176, MyClass::m_n = 1 TemplateContainer<>::print() in thread 140110818879296, MyClass::m_n = 1 ======================================================================
Here only two threads are used and in both of them TemplateContainer::s_obj seems to be initialized correctly.
Can anyone explain whether I'm doing something wrong or if this seems to be a bug?
P.S.: I'm running tests with OMP_NUM_THREADS set to 1 and then number of threads dynamically changed to 2 in code (see main() above), but looks like the way the number of threads is set is irrelevant.