[Oisf-devel] libhtp - Normalization of query string

Thu Jun 27 08:35:55 UTC 2013

On 21/06/2013 18:44, Anoop Saldanha wrote:
> On Fri, Jun 21, 2013 at 10:34 PM, Ivan Ristić <ivan.ristic at gmail.com> wrote:
>> On 21/06/2013 17:36, Anoop Saldanha wrote:
>>> On Fri, Jun 21, 2013 at 6:14 PM, Ivan Ristić <ivan.ristic at gmail.com> wrote:
>>>> On 19/06/2013 11:35, Anoop Saldanha wrote:
>>>>> On Wed, Jun 19, 2013 at 3:45 PM, Ivan Ristic <ivan.ristic at gmail.com> wrote:
>>>>>> On Tue, Jun 18, 2013 at 2:12 PM, Anoop Saldanha <anoopsaldanha at gmail.com> wrote:
>>>>>>> On Mon, Jun 17, 2013 at 6:40 PM, Ivan Ristic <ivan.ristic at gmail.com> wrote:
>>>>>>>> On Mon, Jun 17, 2013 at 9:18 AM, Anoop Saldanha <anoopsaldanha at gmail.com> wrote:
>>>>>>>>> While producing the normalized uri, what is the right way to
>>>>>>>>> generate the normalized query string? Can see 2 solutions -
>>>>>>>>>
>>>>>>>>>     1. Duplicate this code section from htp_unparse_uri_noencode( ) -
>>>>>>>>>
>>>>>>>>>         if (uri->query != NULL) {
>>>>>>>>>             bstr *query = bstr_dup(uri->query);
>>>>>>>>>             htp_uriencoding_normalize_inplace(query);
>>>>>>>>>             bstr_add_c_noex(r, "?");
>>>>>>>>>             bstr_add_noex(r, query);
>>>>>>>>>             bstr_free(query);
>>>>>>>>>         }
>>>>>>>>
>>>>>>>> I think this one is a better approach, although it may depend on
>>>>>>>> exactly how you define normalization.
>>>>>>>
>>>>>>> With htp_uriencoding_normalize_inplace( ) if it sees a %2d it would
>>>>>>> translate it as a '-'(hypen) using x2c, and then checks if it's a
>>>>>>> reserved character and post confirmation leaves it undecoded.  Is this
>>>>>>> the right behaviour?
>>>>>>
>>>>>> It depends. It's ambiguous in the spec, and some argue one way, some
>>>>>> another. Unfortunately, I didn't document my reasoning and so I will
>>>>>> need to go back and double-check.
>>>>>>
>>>>>
>>>>> okay.
>>>>>
>>>>> As an example, I have uris with query strings where %2d is not decoded
>>>>> if I use htp_uriencoding_normalize_inplace().  We are also using this
>>>>> function to decode username, password, fragment and hostname, so will
>>>>> have to check if we face the same issue with these.
>>>>>
>>>>>>
>>>>>>> I would have preferred to use htp_decode_urlencoded_inplace(), but
>>>>>>> it's private and duplication would be a nuisance with all the
>>>>>>> reference to cfg.
>>>>>>
>>>>>> I don't think you can avoid the reference to cfg, because there are
>>>>>> many settings that control exactly how the decoding is done.
>>>>>
>>>>> Right, which should also count as the reason why we can't use
>>>>> htp_uriencoding_normalize_inplace() for query decoding.
>>>>>
>>>>>> There
>>>>>> isn't any one true way. I could create a public function removing the
>>>>>> reference to tx -- would you like that?
>>>>>
>>>>> Yes, that would be helpful.
>>>>>
>>>>> Before you push the commit for this, can I have a look at it to make
>>>>> sure that's what I want?
>>>>
>>>> How about:
>>>>
>>>>     htp_urldecode_inplace_ex(
>>>>         htp_decoder_cfg_t *cfg,
>>>>         bstr *input,
>>>>         uint64_t flags)?
>>>>
>>>
>>> This should be okay.  The flags is to specify whether it's for path or not?
>>
>> No, to tell you what was contained in the string. What type of encoding,
>> and so on. Same as other flags.
>>
> 
> So you meant a "uint64_t *flags".  Sounds good.

I've added a public:

htp_status_t htp_urldecode_inplace(htp_cfg_t *cfg, enum
htp_decoder_ctx_t ctx, bstr *input, uint64_t *flags);

to 0.5.x. Please give it a go.

Another change is that now everything is decoded, even the reserved
characters. It's not clear if this is the right thing to do, and I
suspect that there isn't any one right thing to do. So see if it works
for you.

I will focus on the normalization process in the next release.

-- 
Ivan