I am proving by construction that there is some basis in which a nilpotent endomorphism has a jordan canonical form that has only ones over supradiagonal. I'll put what I have already and stop where my problem is in order you can think it in the way I am.
What I want to prove is:
Theorem
Let T\in\mathcal{L}(V) a r-nilpotent endomorphism, V(\mathbb{C}) a finite-dimensional vector space. There is some basis of V in which the matrix representation of T is a block diagonal matrix, and the blocks have the form
\begin{align*} \left( \begin{array}{cccccc} 0 &1 &0 &0 &\dots &0\\ 0 &0 &1 &0 &\dots &0\\ 0 &0 &0 &1 &\dots &0\\ \vdots &\vdots &\vdots &\vdots &\ddots &\vdots\\ 0 &0 &0 &0 &\dots &1\\ 0 &0 &0 &0 &\dots &0 \end{array} \right) \end{align*}
that is, blocks that have null entries except for the ones-filled supradiagonal.
Proof
First we have that if T is a r-nilpotent endomorphism then T^{r}=0_{\mathcal{L}(V)}, then, since U_{1}=T(V)\subseteq V=id(V)=T^{0}(V)=U_{0} therefore U_{2}=T^{2}(V)=T(T(V))\subseteq T(V)=U_{1} and if we suppose that U_{k}=T^{k}(V)\subseteq T^{k-1}(V)=U_{k-1} we conclude that U_{k+1}=T^{k+1}(V)=T(T^{k}(V))\subseteq T(T^{k-1}(V))=T^{k}(V)=U_{k}. Then we have proven by induction over k that U_{k}=T^{k}(V)\subseteq T^{k-1}(V)=U_{k-1}, and since T^{r}=0_{\mathcal{L}(V)}, and U_{k}=T(U_{k-1}) then \{0_{V}\}=U_{r}\subseteq U_{r-1}\subseteq\dots\subseteq U_{1}\subseteq U_{0}=V and we have shown too that the U_{k} are T-invariant spaces and U_{r-1}\subseteq\ker T.
In the same manner, let W_{0}=\ker T^{0}=\ker id=\{0_{V}\} and W_{k}=\ker T^{k}. Is easy to see that T(W_{0})=T(\{0_{V}\})=\{0_{V}\} therefore W_{0}\subseteq W_{1}, moreover T^{2}(W_{1})=T(T(W_{1}))=T(\{0_{V}\})=\{0_{V}\} therefore W_{1}\subseteq W_{2}. Then, suppose W_{k-1}\subseteq W_{k}, and we see that T^{k+1}(W_{k})=T(T^{k}(W_{k}))=T(\{0_{V}\})=\{0_{V}\} and therefore W_{k}\subseteq W_{k+1} and we conclude we have the chain of nested spaces \{0_{V}\}=W_{0}\subseteq W_{1}\subseteq\dots\subseteq W_{r-1}\subseteq W_{r}=V since W_{r}=\ker T^{r}=\ker 0_{\mathcal{L}(V)}=V.
Since we have a chain of nested spaces in which the largest is V itself, then if we choose a basis for the smallest non-trivial (Supposing U_{r}\neq U_{r-1}) of them (that is U_{r-1}) we can climb chain constructing a basis for the larger spaces completing the basis we have already, what is always possible.
Now, since U_{r-1}\subseteq\ker T then every vector in U_{r-1} is a eigenvector for eigenvalue 0. Then every basis we choose for U_{r-1} is a basis of eigenvectors. To complete this basis \{u_{i}^{(r-1)}\} to a basis of U_{r-2} (Supposing U_{r-1}\neq U_{r-2}) we can remember that T(U_{r-2})=U_{r-1}, therefore every vector in U_{r-1} has a preimage in U_{r-2}. Then there are some u_{i}^{(r-2)}\in U_{r-2} (maybe many for each i since we don't know T is inyective) such that T(u_{i}^{(r-2)})=u_{i}^{(r-1)}. It's to be noted that for fixed i is not possible that u_{i}^{(r-2)}=u_{i}^{(r-1)} since u_{i}^{(r-1)} is an eigenvector associated to eigenvalue 0 and also every vector in U_{r-1} since they are linear combinations of the basis vectors. Since we have stated they are non unique we can choose one and only one for every i. It only remains to see they are linearly independent: let take a linear combination of null vector \alpha_{i}u_{i}^{(r-1)}+\beta_{i}u_{i}^{(r-2)}=0_{V} and let apply T on both sides, \alpha_{i}T(u_{i}^{(r-1)})+\beta_{i}T(u_{i}^{(r-2)})=\sum_{i}\alpha_{i}0_{V}+\beta_{i}u_{i}^{(r-1)}=\beta_{i}u_{i}^{(r-1)}=0_{V}. Since the last sum is a null linear combination of linearly independent vectors (since they form a basis for U_{r-1}), it implies that \beta_{i}=0 for every i. Therefore the initial expression takes the form \alpha_{i}u_{i}^{(r-1)}=0_{V} and \alpha_{i}=0 for every i by the same argument. We conclude that they are linearly independent.
At this moment we have \{u_{i}^{(r-1)},u_{i}^{(r-2)}\} a linearly independent set of vectors in U_{r-2}. If \dim U_{r-2}=2\dim U_{r-1}, then we have finished the construction, if not (\dim U_{r-2}\geq 2\dim U_{r-1}+1) then we have to choose u_{j}^{(r-2)} with j=\dim U_{r-1}+1,\dots, \dim U_{r-2} that complete the set to a basis of U_{r-2}. Again, is in construction of the u_{i}^{(r-2)}, we remember that T(U_{r-2})=U_{r-1}. Therefore, every vector we choose will have, under T, the form T(v_{j}^{(r-2)})=\mu_{ji}u_{i}^{(r-1)}. But since we want they to be linearly independent from the u_{i}^{(r-1)} and u_{i}^{(r-2)} we can choose them from \ker T, that is we can set u_{j}^{(r-2)}=v_{j}^{(r-2)}-\mu_{ji}u_{i}^{(r-2)} and applying T we obtain T(u_{i}^{(r-2)})=T(v_{j}^{(r-2)})-\mu_{ji}T(u_{i}^{(r-2)})=\mu_{ji}u_{i}^{(r-1)}-\mu_{ji}u_{i}^{(r-1)}=0_{V}. Then we only need to see they are linearly independent with the others. Let, again, a null linear combination \alpha_{i}u_{i}^{(r-1)}+\beta_{i}u_{i}^{(r-2)}+\gamma_{j}u_{j}^{(r-2)}=0_{V}. First we can apply T both sides: \alpha_{i}T(u_{i}^{(r-1)})+\beta_{i}T(u_{i}^{(r-2)})+\gamma_{j}T(u_{j}^{(r-2)})=\sum_{i}\alpha_{i}0_{V}+\beta_{i}u_{i}^{(r-1)}+\sum_{j}\gamma_{j}0_{V}=\beta_{i}u_{i}^{(r-1)}=0_{V} and therefore \beta_{i}=0 for every i since \{u_{i}^{(r-2)}\} is a basis. Then the initial expression takes the form \alpha_{i}u_{i}^{(r-1)}+\gamma_{j}u_{j}^{(r-2)}=0_{V}. Note that we have to sets of vectors that are in \ker T...
This is the point where I don't see a way to say that the \alpha_{i},\gamma_{i}=0 for every i in order to say that they are linearly independent. Any kind of help (hints more than everything else) will be good.
Mostly, you are both on the right track and everything you say is correct, though there are a few spots where a bit more thought could let you be sharper. Let me discuss them first.
You note along the way that "(Supposing U_r\neq U_{r-1})". In fact, we know that for each i, 0\leq i\lt r, U_{i+1}\neq U_i. The reason is that if we have U_{i+1}=U_i, then that means that U_{i+2}=T(U_{i+1}) = T(U_i) = U_{i+1}, and so we have reached a stabilizing point; since we know that the sequence must end with the trivial subspace, that would necessarily imply that U_i=\{\mathbf{0}\}. But we are assuming that the degree of nilpotence of T is r, so that U_i\neq\{\mathbf{0}\} for any i\lt r; hence U_{i+1}\neq U_i is a certainty, not an assumption.
You also comment parenthetically: "(maybe many for each i since we don't know T is injective)". Actually, we know that T is definitely not injective, because T is nilpotent. The only way T could be both nilpotent and injective is if \mathbf{V} is zero dimensional. And since every vector of U_{r-1} is mapped to 0, it is certainly the case that the restriction of T to U_i is not injective for any i, 0\leq i\lt r.
As to what you are doing: suppose u_1,\ldots,u_t are the basis for U_{r-1}, and v_1,\ldots,v_t are vectors in U_{r-2} such that T(v_i) = u_i. We want to show that \{u_1,\ldots,u_t,v_1,\ldots,v_t\} is linearly independent; you can do that the way you did before: take a linear combination equal to \mathbf{0},
\alpha_1u_1+\cdots+\alpha_tu_t + \beta_1v_1+\cdots+\beta_t v_t = \mathbf{0}.
Apply T to get \beta_1u_1+\cdots + \beta_tu_t=\mathbf{0} and conclude the \beta_j are zero; and then use the fact that u_1,\ldots,u_t is linearly independent to conclude that \alpha_1=\cdots=\alpha_t=0.
Now, this may not be a basis for U_{r-2}, since there may be elements of \mathrm{ker}(T)\cap U_{r-2} that are not in U_{r-1}.
The key is to choose what is missing so that they are linearly independent from u_1,\ldots,u_t. How can we do that? Note that U_{r-1}\subseteq \mathrm{ker}(T), so in fact U_{r-1}\subseteq \mathrm{ker}(T)\cap U_{r-2}.
So we can complete \{u_1,\ldots,u_t\} to a basis for \mathrm{ker}(T)\cap U_{r-2} with some vectors z_1,\ldots z_s.
The question is now is how to show that \{u_1,\ldots,u_t,v_1,\ldots,v_t,z_1,\ldots,z_s\} are linearly independent. The answeer is: the same way. Take a linear combination equal to 0:
\alpha_1u_1+\cdots +\alpha_tu_t + \beta_1v_1+\cdots +\beta_tv_t + \gamma_1z_1+\cdots+\gamma_s z_s = \mathbf{0}.
Apply T to conclude that the \beta_i are zero; then use the fact that \{u_1,\ldots,u_t,z_1,\ldots,z_s\} is a basis for \mathrm{ker}(T)\cap U_{r-2} to conclude that the \alpha_i and the \gamma_j are all zero as well.
And now you have a basis for U_{r-2}. Why? Because by the Rank-Nullity
Theorem applied to the restriction of T to U_{r-2}, we know that
\dim(U_{r-2}) = \dim(T(U_{r-2})) + \dim(\mathrm{ker}(T)\cap U_{r-2}).
But T(U_{r-2}) = U_{r-1}, so \dim(T(U_{r-2})) = \dim(U_{r-1}) = t; and \dim{ker}(T)\cap U_{r-2} = t+s, since \{u_1,\ldots,u_t,z_1,\ldots,z_s\} is a basis for this subspace. Hence, \dim(U_{r-2}) = t+t+s=2t+s, which is exactly the number of linearly independent vectors you have.
You want to use the same idea "one step up": you will have that u_1,\ldots,u_t,z_1,\ldots,z_s is a linearly independent subset of U_{r-3}\cap\mathrm{ker}(T), so you will complete it to a basis of that intersection; after adding preimages to z_1,\ldots,z_s and v_1,\ldots,v_t, you will get a "nice" basis for U_{r-3}. And so on.