Processing math: 100%
(Analysis by Benjamin Qi)

For N20, any reasonable polynomial-time solution should work. One possible approach is to calculate the result for all nN,k(n2) in O(N7).

For additional points, we should find a way to compute di(a) without explicitly constructing the tree. The key condition is that j is an ancestor of i if a[j]=min(a[ij]), so it follows that

di(a)=1+1j<i(a[j]==min(a[ji]))+i<jn(a[j]==min(a[ij])).
Let's focus on counting the number of permutations a such that a[j]==min(a[ij]) for some fixed pair (i,j) satisfying i<j. We'll do this by constructing a one element at a time.

First, we start with a sequence consisting of a[i] only. Then a[i+1] can be either greater than a[i] or less than a[i], contributing 0 or 1 inversion. Then a[i+2] can take on any of three different values relative to a[i] and a[i+1], contributing anywhere from 0 to 2 inversions. Continuing in this fashion, the possible numbers of inversions in the sub-permutation a[ij1] can be represented by the polynomial product

jit=1(t1u=0xu).
This is known as a generating function because we are encoding a sequence using a polynomial. If we expand it and group together the terms with the same power of x, then a term in the form cxd means that there are exactly c permutations with d inversions.

Adding a[j] contributes ji inversions regardless of how many inversions a[ij1] has. Then we should add the remaining elements of the permutation, each of which can go anywhere in the sorted order. Thus, the final result is given by the generating function

nt=1(t1u=0xu)1jiu=0xuxji.
The first part of this product does not depend on i or j, and we can calculate it in O(N3) time with prefix sums. We can divide it by jiu=0xu in O(N2) time by reversing the process we used to multiply.

After dividing, all we need is the coefficient of xk(ji). Since the product depends only on ji, we only need to do N different divisions. Alternatively, we can maintain prefix and suffix products without needing to do division. The process for i>j is almost exactly the same, except a[j] contributes 0 inversions rather than ij.

The whole solution runs in O(N3) time and O(N2) memory. My code follows. It turns out that M being prime is irrelevant ...

#include <bits/stdc++.h>
using namespace std;

typedef long long ll;
typedef vector<int> vi; 
 
#define FOR(i,a,b) for (int i = (a); i < (b); ++i)
#define F0R(i,a) FOR(i,0,a)
#define ROF(i,a,b) for (int i = (b)-1; i >= (a); --i)
#define R0F(i,a) ROF(i,0,a)
#define trav(a,x) for (auto& a: x)
 
#define pb push_back
#define rsz resize
#define sz(x) int(x.size())
 
void setIO(string name) {
	ios_base::sync_with_stdio(0); cin.tie(0);
	freopen((name+".in").c_str(),"r",stdin);
	freopen((name+".out").c_str(),"w",stdout);
}

int MOD;
int n,k;

typedef int T;
struct mi {
	T val; 
	mi() { val = 0; }
	mi(const ll& v) { 
		val = (-MOD <= v && v <= MOD) ? v : v % MOD;
		if (val < 0) val += MOD;
	}
	mi& operator+=(const mi& m) { 
		if ((val += m.val) >= MOD) val -= MOD; 
		return *this; }
	mi& operator-=(const mi& m) { 
		if ((val -= m.val) < 0) val += MOD; 
		return *this; }
};
typedef vector<mi> vmi;
 
void ad(vmi& a, int b) { // multiply by (x^0+x^1+...+x^{b-1})
	a.rsz(sz(a)+b-1);
	R0F(i,sz(a)-b) a[i+b] -= a[i];
	FOR(i,1,sz(a)) a[i] += a[i-1];
}
void sub(vmi& a, int b) {
	ROF(i,1,sz(a)) a[i] -= a[i-1];
	F0R(i,sz(a)-b) a[i+b] += a[i];
	a.rsz(sz(a)-b+1); 
}
mi get(vmi& a, int b) {
	if (b < 0 || b >= sz(a)) return 0;
	return a[b];
}
 
int main() {
	setIO("treedepth"); 
	cin >> n >> k >> MOD;
	vmi v = {1}; FOR(i,1,n+1) ad(v,i);
	vmi ans(n,v[k]);
	FOR(dif,1,n) {
		sub(v,dif+1);
		mi x = get(v,k-dif), y = get(v,k);
		ad(v,dif+1);
		F0R(a,n-dif) {
			ans[a] += x;
			ans[a+dif] += y;
		}
	}
	F0R(i,n) cout << ans[i].val << ' ';
}