(Analysis by Botao Yuan)

We are asked, for $N$ queries, what is the cheapest cost required to remove $a_i$ haybales. To remove haybales, we can hire cows. Cow $i$ has attributes ($s_i$, $p_i$, $c_i$), indicating that if the stack has $h$ haybales, we can pay $c_i$ to remove

$\min(s_i, \max(0, h - p_i + 1))$ haybales.

Let us now consider what an optimal (or close-to-optimal) hiring order looks like. It is clear that we should generally try to hire cows where $s_i / c_i $ is larger (given that $p_i$ > $h$, our current haybale amount). Call the cow that maximizes this quantity the “best-rate cow”. Notice that every time we are at a value $p_i - 1$, the best-rate cow may change. Call all values $p_i - 1$ special values.

The problem now is the constraints on $p_i$; when $h - p_i + 1 < s_i$, a cow removes fewer than $s_i$ haybales while still costing $c_i$. This means the optimal strategy near a special value depends on the exact remainder, so always greedily choosing the best-rate cow until we exhaust all haybales is not optimal.

On the other hand, if we are an extremely long distance away from any special value, we should choose the best-rate cow some number of times, until some threshold where we need to worry about the remainder. It turns out this threshold is $S^2$ away from the current special value we are considering.

Claim: For $d \ge S^2$ (where $S$ is the maximum value of $s_i$), the optimal cost satisfies $f(d) = f(d - s^*) + c^*$, where $(s^*, c^*)$ is the best-rate cow (highest $s_i / c_i$). Here, $f(x)$ indicates the minimum cost to remove $x$ haybales above the current special value, assuming we only consider cows with $p_i - 1$ less than or equal to the current special value.

Proof: Consider the optimal solution for some $d \ge S^2$. If it uses the the best-rate cow first, then we have $f(d) = c^* + f(d - s^*)$, and we are done. Now suppose the best-rate cow is used at least once among the first $s^*$ cows to clear $s^*$ haybales. Then, we can swap the order of cows so that we use the best-rate cow first. Since every cow still clears the same number of haybales, $f(d)$ does not change after this swap. Hence $f(d) = c^* + f(d - s^*)$ from the first case.

Otherwise, suppose the solution does not use the best-rate cow among the first $s^*$ cows. Then, the solution hires some sequence of cows removing $s_{i_1}, s_{i_2}, \ldots, s_{i_t}$ haybales at costs $c_{i_1}, \ldots, c_{i_t}$. Since each $s_{i_j} \le S$ and their sum is at least $d \ge S^2 \ge S \cdot s^*$, we have $t \ge s^*$.

Consider just the first $s^* + 1$ partial sums $0, s_{i_1}, s_{i_1} + s_{i_2}, \ldots$ modulo $s^*$. By pigeonhole, two partial sums are congruent mod $s^*$, so some contiguous subsequence within the first $s^*$ cows removes exactly $k \cdot s^*$ haybales for some integer $k \ge 1$, at total cost $C_{\text{sub}}$.

Since $(s^*, c^*)$ has the best rate, every cow $i$ satisfies $s_i / c_i \le s^* / c^*$, i.e., $c_i \ge s_i \cdot c^* / s^*$. Summing over the subsequence: $C_{\text{sub}} \ge (k \cdot s^*) \cdot c^* / s^* = k \cdot c^*$.

So replacing this subsequence with $k$ copies of the best-rate cow removes the same number of haybales at cost $k \cdot c^* \le C_{\text{sub}}$. After this replacement, the best-rate cow now appears among the first $s^*$ hirings, at no greater total cost. This reduces to the second case, so $f(d) = c^* + f(d - s^*)$.

Therefore, for all $d \ge S^2$, we have $f(d) = c^* + f(d - s^*)$, and only $O(S^2)$ DP states need to be explicitly computed.

You may notice that this is also analogous to the Frobenius/Coin problem.

Partial Solution: O(MS^3)

We will process cows in increasing $p_i$ and process queries in increasing $a_i$. We will also keep track of the most efficient cow (i.e., the cow with the best $s_i / c_i$), for every $s_i$. When we encounter a new $p_i$, we will compute a DP table of size $S^2$, where $DP_j$ indicates the answer for if we have $p_i - 1 + j$ haybales.

To recompute the DP table, firstly, we can simply query the previous DP table for our initial values (i.e. for each $j$, treat $DP_j$ as a query to the previous DP table, not taking into account the new cow). Then, we can do a knapsack on each of the $S$ cows which have the best rate for their $s_i$. This transition is

$$DP_j = \min(DP_j, \; c + DP_{\max(0, \, j - s_i)}), \quad j = 0, 1, \ldots, S^2.$$

There are $S^2$ states, so across $S$ sweeps, the complexity is $O(S^3)$.

After processing $p_i$, answer all queries $a_k$ such that $a_k < p_{i+1}$. To answer a query, if $a_k < p_i - 1 + S^2$, simply refer to the DP table: the answer is $DP_{a_k - (p_i - 1)}$. Otherwise, use the best-rate cow $(s^*, c^*)$ to reduce $a_k$ into the table's range: compute $q = \lceil (a_k - (p_i - 1) - S^2 + 1) / s^* \rceil$, and the answer is $DP_{a_k - (p_i - 1) - q \cdot s^*} + q \cdot c^*$.

The total runtime is $O(MS^3 + N\log N + M\log M)$.

Full Solution: O(MS^2)

We can speed up the DP recomputation by observing that when we process a new cow with threshold $p_i$, we do not need to redo the knapsack for all $S$ distinct work values. The previous DP table already encodes the optimal combinations of all previously seen cows. When we initialize the new table by querying the old one, those earlier cows' contributions are already accounted for.

The only new information is the cow we just introduced. So after initializing the $S^2$ states from the previous table, we perform a single knapsack sweep using only the new cow's work $w$ and cost $c$. Hence, this step is now $O(S^2)$.

This single sweep costs $O(S^2)$ per cow, giving a total runtime of $O(MS^2 + N\log N + M\log M)$.

Sujay’s implementation:

 
#include <bits/stdc++.h>
using namespace std;
using ll = long long;
const ll INF = 1e18;
 
const int S = 100;
 
void upd(ll& a, ll b) { a = min(a, b); }
 
int main() {
    ios::sync_with_stdio(false), cin.tie(nullptr);
    int t; cin >> t;
    while(t--) {
        ll n; cin >> n;
        vector<ll> a(n), oa(n);
        for(int i = 0; i < n; i++) {
            cin >> a[i];
            oa[i] = i;
        }
        sort(oa.begin(), oa.end(), [&] (ll x, ll y) { return a[x] < a[y]; });
        ll m; cin >> m;
        vector<ll> l(m), s(m), v(m), o(m);
        for(int i = 0; i < m; i++) {
            cin >> l[i] >> s[i] >> v[i];
            l[i]--;
            o[i] = i;
        }
        sort(o.begin(), o.end(), [&] (ll x, ll y) { return l[x] < l[y]; });
        int pl = 0;
        vector<ll> dp(S * S, INF); dp[0] = 0;
        vector<ll> best(S + 1, 1e15);
 
        // computes expdp which is used to query
        vector<ll> expdp;
        int bsi = 1;
        auto compute = [&] (int nsi) {
            for(int si = 1; si <= S; si++) {
                if(best[bsi] * si > bsi * best[si]) {
                    bsi = si;
                }
            }
            expdp = dp;
            for(int i = 0; i < expdp.size(); i++) {
                if(i + nsi < expdp.size()) {
                    upd(expdp[i + nsi], expdp[i] + best[nsi]);
                }
            }
        };
        // uses expdp to query
        auto query = [&] (ll x) {
            ll d = x - pl;
            ll t = max((d - (ll)expdp.size()) / bsi + 1, 0ll);
            return t * best[bsi] + expdp[d - bsi * t];
        };
        int k = 0;
        int lasts = 0;
        vector<ll> ans(n);
        for(int i : o) {
            compute(lasts);
            while(k < n && a[oa[k]] < l[i]) ans[oa[k]] = query(a[oa[k]]), k++;
            vector<ll> ndp(S * S, INF);
            for(int j = 0; j < S * S; j++) {
                ndp[j] = query(l[i] + j);
            }
            for(int j = 1; j < s[i]; j++) upd(ndp[j], ndp[0] + v[i]);
            upd(best[s[i]], v[i]);
            lasts = s[i];
            dp = ndp;
            pl = l[i];
        }
        compute(lasts);
        while(k < n) ans[oa[k]] = query(a[oa[k]]), k++;
        for(ll i : ans) cout << i << " ";
        cout << endl;
    }
}