ACME standalone broken in pfSense 2.4.2?



  • I made a few other changes to my pfSense machine, but upgraded to 2.4.2 was the biggest change. Wondering if it broke my existing ACME setup.

    I run standalone mode, with an HAproxy ACL:

       acl         ACME   path_beg -i /.well-known/acme-challenge/
    
    

    Which forwards HTTP traffic to the backend of the acme.sh script when it's running.

    –-

    When I attempted to create a new certificate for a new domain today, I got a message reading:

    [Tue Dec 5 17:32:14 PST 2017] REDACTED.example.com:Verify error:Fetching [url]http://REDACTED.example.com/.well-known/acme-challenge/GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To[/url]: Timeout
    
    

    I checked the REDACTED.example.com/acme_issuecert.log file and saw:

    [Tue Dec  5 17:32:04 PST 2017] checking
    [Tue Dec  5 17:32:04 PST 2017] GET
    [Tue Dec  5 17:32:04 PST 2017] url='[url]https://acme-staging.api.letsencrypt.org/acme/challenge/mN6TKFUw4D4TW0eU88l455rsOMzO5Y100c1xg-FOOBAR/81224058[/url]'
    [Tue Dec  5 17:32:04 PST 2017] timeout
    [Tue Dec  5 17:32:04 PST 2017] curl exists=0
    [Tue Dec  5 17:32:04 PST 2017] wget exists=127
    [Tue Dec  5 17:32:04 PST 2017] _CURL='curl -L --silent --dump-header /tmp/acme/REDACTED.example.com//http.header '
    [Tue Dec  5 17:32:05 PST 2017] ret='0'
    [Tue Dec  5 17:32:05 PST 2017] original='{
      "type": "http-01",
      "status": "pending",
      "uri": "https://acme-staging.api.letsencrypt.org/acme/challenge/mN6TKFUw4D4TW0eU88l455rsOMzO5Y100c1xg-FOOBAR/81224058",
      "token": "GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To",
      "keyAuthorization": "GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To.SWhpVHEbWXGHa_ckmbUPkkwK6AJ-328vsI-mQuZhiAE"
    }'
    [Tue Dec  5 17:32:05 PST 2017] response='{"type":"http-01","status":"pending","uri":"https://acme-staging.api.letsencrypt.org/acme/challenge/mN6TKFUw4D4TW0eU88l455rsOMzO5Y100c1xg-FOOBAR/81224058","token":"GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To","keyAuthorization":"GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To.SWhpVHEbWXGHa_ckmbUPkkwK6AJ-328vsI-mQuZhiAE"}'
    [Tue Dec  5 17:32:05 PST 2017] Pending
    [Tue Dec  5 17:32:05 PST 2017] sleep 2 secs to verify
    
    

    These lines are repeated ~4 times before finally reporting:

    [Tue Dec  5 17:32:14 PST 2017] checking
    [Tue Dec  5 17:32:14 PST 2017] GET
    [Tue Dec  5 17:32:14 PST 2017] url='[url]https://acme-staging.api.letsencrypt.org/acme/challenge/mN6TKFUw4D4TW0eU88l455rsOMzO5Y100c1xg-FOOBAR/81224058[/url]'
    [Tue Dec  5 17:32:14 PST 2017] timeout
    [Tue Dec  5 17:32:14 PST 2017] curl exists=0
    [Tue Dec  5 17:32:14 PST 2017] wget exists=127
    [Tue Dec  5 17:32:14 PST 2017] _CURL='curl -L --silent --dump-header /tmp/acme/REDACTED.example.com//http.header '
    [Tue Dec  5 17:32:14 PST 2017] ret='0'
    [Tue Dec  5 17:32:14 PST 2017] original='{
      "type": "http-01",
      "status": "invalid",
      "error": {
        "type": "urn:acme:error:connection",
        "detail": "Fetching [url]http://REDACTED.example.com/.well-known/acme-challenge/GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To[/url]: Timeout",
        "status": 400
      },
      "uri": "https://acme-staging.api.letsencrypt.org/acme/challenge/mN6TKFUw4D4TW0eU88l455rsOMzO5Y100c1xg-FOOBAR/81224058",
      "token": "GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To",
      "keyAuthorization": "GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To.SWhpVHEbWXGHa_ckmbUPkkwK6AJ-328vsI-mQuZhiAE",
      "validationRecord": [
        {
          "url": "http://REDACTED.example.com/.well-known/acme-challenge/GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To",
          "hostname": "REDACTED.example.com",
          "port": "80",
          "addressesResolved": [
            "1.2.3.4"
          ],
          "addressUsed": "1.2.3.4",
          "addressesTried": []
        }
      ]
    }'
    [Tue Dec  5 17:32:14 PST 2017] response='{"type":"http-01","status":"invalid","error":{"type":"urn:acme:error:connection","detail":"Fetching [url]http://REDACTED.example.com/.well-known/acme-challenge/GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To[/url]: Timeout","status": 400},"uri":"https://acme-staging.api.letsencrypt.org/acme/challenge/mN6TKFUw4D4TW0eU88l455rsOMzO5Y100c1xg-FOOBAR/81224058","token":"GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To","keyAuthorization":"GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To.SWhpVHEbWXGHa_ckmbUPkkwK6AJ-328vsI-mQuZhiAE","validationRecord":[{"url":"http://REDACTED.example.com/.well-known/acme-challenge/GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To","hostname":"REDACTED.example.com","port":"80","addressesResolved":["73.158.251.232"],"addressUsed":"73.158.251.232","addressesTried":[]}]}'
    [Tue Dec  5 17:32:14 PST 2017] error='"error":{"type":"urn:acme:error:connection","detail":"Fetching [url]http://REDACTED.example.com/.well-known/acme-challenge/GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To[/url]: Timeout","status": 400'
    [Tue Dec  5 17:32:14 PST 2017] errordetail='Fetching [url]http://REDACTED.example.com/.well-known/acme-challenge/GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To[/url]: Timeout'
    [Tue Dec  5 17:32:14 PST 2017] REDACTED.example.com:Verify error:Fetching [url]http://REDACTED.example.com/.well-known/acme-challenge/GVi_q_EG8jQVcwj2w2v2O_bJRja29oqbarIYNOEb1To[/url]: Timeout
    
    

    After this point it seems to make another request to localhost:8082 (where the standalone instance is running)? This fails because the socket is closed at this point. And finally tries to ping LE one more time before quitting:

    [Tue Dec  5 17:32:16 PST 2017] original='{
      "type": "urn:acme:error:malformed",
      "detail": "Unable to update challenge :: The challenge is not pending.",
      "status": 400
    }'
    [Tue Dec  5 17:32:16 PST 2017] responseHeaders='HTTP/1.1 100 Continue
    Expires: Wed, 06 Dec 2017 01:32:16 GMT
    Cache-Control: max-age=0, no-cache, no-store
    Pragma: no-cache
    
    HTTP/1.1 400 Bad Request
    Server: nginx
    Content-Type: application/problem+json
    Content-Length: 132
    Boulder-Requester: 2578726
    Replay-Nonce: 4UZdr2lNsdFgkD_GW5-COZ5bWPKOxK-R7r4TBOvuEAs
    Expires: Wed, 06 Dec 2017 01:32:16 GMT
    Cache-Control: max-age=0, no-cache, no-store
    Pragma: no-cache
    Date: Wed, 06 Dec 2017 01:32:16 GMT
    Connection: close
    ^M'
    [Tue Dec  5 17:32:16 PST 2017] response='{"type":"urn:acme:error:malformed","detail":"Unable to update challenge :: The challenge is not pending.","status": 400}'
    [Tue Dec  5 17:32:16 PST 2017] code='400'
    
    

    (FOOBAR has replaced a portion of the LE token for redaction purposes.)

    –-

    I verified with a packet capture that LE is successfully reaching in through the firewall and HAproxy frontend to the backend:

    GET /.well-known/acme-challenge/WH7JUzQRpshMc5cvz3MDkHX3NBK-VozfezHARFQdViI HTTP/1.1
    Host: REDACTED.example.com
    User-Agent: Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)
    Accept: */*
    Accept-Encoding: gzip
    Connection: close
    
    

    And the response from my device:

    HTTP/1.1 200 OK
    
    WH7JUzQRpshMc5cvz3MDkHX3NBK-VozfezHARFQdViI.SWhpVHEbWXGHa_ckmbUPkkwK6AJ-328vsI-mQuZhiAE
    
    

    So the token is in place, as well as the keying mechanism. (Right? As far as I know the ACME protocol, things seem ok to this point.)

    –-

    I don't fully understand what is supposed to come next for the HTTP challenge model. From the logs it looks like the ACME client on my device is attempting to reach out to LE again but the challenge has moved from pending state to invalid, and rejects the ACME client's message with urn:acme:error:malformed.

    All of this process was working before the update to 2.4.2, though I never dug this deep into what was happening in the protocol since it was working.  :P
    Any thoughts? Thanks for any help.


  • Rebel Alliance Developer Netgate

    It's possible there is something broken there. I know the acme.sh project has changed how they handle standalone mode and we haven't updated to the latest version of the script yet. There are changes coming (or maybe already done) on the LE side where they verify updates in a way that isn't compatible with the current standalone method.


  • Rebel Alliance Developer Netgate

    I just synchronized us up to the latest acme.sh and I tested standalone on the new code and it worked. Update your ACME package to 0.1.30 and give it a shot.



  • Thanks for the input, jimp.

    I checked yesterday if the ACME package had an update and it didn't. I see a new commit in the pfsense/FreeBSD-ports repo – is this the update? Thanks for doing this work!

    I had done a few executions of the acme.sh script from the command line with the –debug 2 argument and noticed that it was complaining about nc and socat. I was going to look into that today.

    –-

    I just updated the ACME package to 2.7.5 and am running into the same timeout issue. When running with –debug 2, I'm seeing this towards the end of the execution:

    socat:
    socat by Gerhard Rieger and contributors - see www.dest-unreach.org
    Usage:
    socat [options] <bi-address><bi-address>options:
       ...</bi-address></bi-address> 
    

    Which makes me think that the invocation of socat is off in some way? This is similar output that the previous version of ACME had when trying to invoke nc.
    It still seems like LE reaches into my infrastructure, I reply with the proper nonce for the token, and then the ACME client polls the LE challenge a few times, waiting for the state to change from pending to something. The state eventually becomes invalid, and the ACME client fails.

    I suppose I can post in the acme.sh repo to see if Neilpang has any input since you've confirmed it's working on pfSense. Perhaps there is some file state on disk that is still causing it to fail on my end? I'll try with a fresh domain and confirmation process.

    Thanks jimp.


  • Rebel Alliance Developer Netgate

    Yes that is the correct commit you linked from FreeBSD-ports.

    Did you update the ACME package to 0.1.30 or did you manually change files or copy things from the acme.sh project? Our files are not straight copies of theirs for the bulk of the code, there are some platform-specific differences, so if you tried to use the acme.sh code directly it may not work properly.

    I'm not sure what might be happening with your setup but I don't run acme.sh by hand, I let the package invoke everything and it all works properly. Unless you carefully setup the environment, running it manually is doomed to fail. I wouldn't consider that a good/proper test.

    If you get a timeout, the #1 suspect is that you aren't allowing traffic to hit the firewall's WAN address on port 80.



  • Ah ha! I found the issue! jimp, was HAProxy updated with the upgrade to 2.4.2?

    Using the standalone method, I created a backend in the HAProxy UI with a single server bound to localhost on port 8082, with no health checks, or timeout/retry settings.
    (I didn't use health checks was because this backend is "down" whenever the acme.sh script isn't running. So I just ignored the overhead of doing a health check.)

    The problem here is there is an IPv4 and IPv6 address for localhost, so in the newest version of HAProxy, it actually created two servers even though the UI only had one specified:

    backend 0_HTTP_ACME_Standalone_http_ipvANY
    	mode			http
    	log			global
    	timeout connect		30000
    	timeout server		30000
    	retries			3
    	server			pfsense_0 127.0.0.1:8082  
    	server			pfsense_1 ::1:8082  
    
    

    The ACME client only binds to the IPv4 interface using socat:

    [2.4.2-RELEASE][root@pfsense]/tmp/acme: sockstat -l46 | grep 8082
    root     socat      96563 5  tcp4   *:8082                *:*
    
    

    (I ran this sockstat command during the execution of certificate creation/renewal and it only ever listens on 127.0.0.1:8082.)

    And the nail in the coffin is that with the None option specified in the load balancing section of the backend, it defaults to Round Robin:

    @HAProxy:

    The load balancing algorithm of a backend is set to roundrobin when no other
    algorithm, mode nor option have been set. The algorithm may only be set once
    for each backend.

    So LE was successfully reaching in to my infrastructure (as I noted with the packet capture) on the 127.0.0.1:8082 server, but when it attempted to reach in again, it would be Round Robin'd to the ::1:8082 server, to which the ACME client wasn't bound. This would then timeout and cause the validation process to fail.

    –-

    I've adjusted the backend to only listen on 127.0.0.1, not localhost, but I would be more satisfied if I knew the proper knobs to turn such that the HAProxy backend would timeout quickly, and a new request would be issued to the next server in the list. Alas, I will have to figure that out another day.

    Thanks for your help, jimp!

    –-

    PS - I actually tried to pull the HEAD of acme.sh's repo to execute on the pfSense box, but as soon as I saw the output I knew there had to be adjustments made to the source code that made it compatible with pfSense. I gave that up quickly.

    I also only ran it from the command line because the ACME UI actually prints out the full command of what it's executing under the covers. I thought it was safe to copy and paste it so I could have control over execution during debugging.  :)

    Thanks again


Log in to reply